本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新521篇论文,其中:

  • 自然语言处理71
  • 信息检索20
  • 计算机视觉61

自然语言处理

1. 【2602.17664】Sink-Aware Pruning for Diffusion Language Models

链接https://arxiv.org/abs/2602.17664

作者:Aidar Myrzakhan,Tianyi Li,Bowei Guo,Shengkun Tang,Zhiqiang Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Diffusion Language Models, Diffusion Language, incur high inference, high inference cost, inference cost due

备注: Code at: [this https URL](https://github.com/VILA-Lab/Sink-Aware-Pruning)

点击查看摘要

Abstract:Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at this https URL.

2. 【2602.17663】CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

链接https://arxiv.org/abs/2602.17663

作者:Juri Opitz,Corina Raclé,Emanuela Boros,Andrianos Michail,Matteo Romanello,Maud Ehrmann,Simon Clematide

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:multilingual historical texts, CLEF evaluation lab, person-place relation extraction, dedicated to person-place, CLEF evaluation

备注: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at [this https URL](https://hipe-eval.github.io/HIPE-2026/)

点击查看摘要

Abstract:HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

3. 【2602.17655】What Language is This? Ask Your Tokenizer

链接https://arxiv.org/abs/2602.17655

作者:Clara Meister,Ahmetcan Yavuz,Pietro Lesci,Tiago Pimentel

类目:Computation and Language (cs.CL)

关键词:facilitates corpus curation, training data analysis, multilingual natural language, natural language processing, corpus curation

备注

点击查看摘要

Abstract:Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

4. 【2602.17653】Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

链接https://arxiv.org/abs/2602.17653

作者:Iskar Deng,Nathalia Xu,Shane Steinert-Threlkeld

类目:Computation and Language (cs.CL)

关键词:resemble cross-linguistic regularities, Recent work, word order, work has shown, resemble cross-linguistic

备注: 15 pages, 7 figures, 7 tables. Under review

点击查看摘要

Abstract:Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

5. 【2602.17645】Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

链接https://arxiv.org/abs/2602.17645

作者:Xiaohan Zhao,Zhaoyi Li,Yaxin Luo,Jiacheng Cui,Zhiqiang Shen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, complex multimodal boundaries, Vision-Language Models, Large Vision-Language, multimodal boundaries

备注: Code at: [this https URL](https://github.com/vila-lab/M-Attack-V2)

点击查看摘要

Abstract:Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: this https URL.

6. 【2602.17623】Unmasking the Factual-Conceptual Gap in Persian Language Models

链接https://arxiv.org/abs/2602.17623

作者:Alireza Sakhaeirad,Ali Ma'manpoosh,Arshia Hemmat

类目:Computation and Language (cs.CL)

关键词:emerging Persian NLP, implicit social norms, Persian NLP benchmarks, memorized cultural facts, Persian NLP

备注

点击查看摘要

Abstract:While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.

7. 【2602.17598】he Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

链接https://arxiv.org/abs/2602.17598

作者:Jayadev Billa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词:perform implicit ASR, largely perform implicit, Current speech LLMs, LLMs largely perform, implicit ASR

备注: 10 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($\kappa{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

8. 【2602.17588】Modeling Distinct Human Interaction in Web Agents

链接https://arxiv.org/abs/2602.17588

作者:Faria Huq,Zora Zhiruo Wang,Zhanqiu Guo,Venu Arvind Arangarajan,Tianyue Ou,Frank Xu,Shuyan Zhou,Graham Neubig,Jeffrey P. Bigham

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:involvement remains essential, human involvement remains, correcting agent behavior, rapid progress, progress in autonomous

备注: Preprint

点击查看摘要

Abstract:Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

9. 【2602.17547】KLong: Training LLM Agent for Extremely Long-horizon Tasks

链接https://arxiv.org/abs/2602.17547

作者:Yue Liu,Zhiyuan Hu,Flood Sung,Jiaheng Zhang,Bryan Hooi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:open-source LLM agent, LLM agent trained, open-source LLM, LLM agent, agent trained

备注

点击查看摘要

Abstract:This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

10. 【2602.17546】Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

链接https://arxiv.org/abs/2602.17546

作者:Jyotin Goel,Souvik Maji,Pratik Mazumder

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Instruction-following language models, Instruction-following language, behavior can deteriorate, deteriorate under benign, worsen under adversarial

备注: Work in progress (30 pages)

点击查看摘要

Abstract:Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

11. 【2602.17544】Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

链接https://arxiv.org/abs/2602.17544

作者:Shashank Aggarwal,Ram Vikas Mishra,Amit Awekar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:LLM-based agents exchange, agents exchange intermediate, search and ranking, LLM-based agents, exchange intermediate reasoning

备注

点击查看摘要

Abstract:In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.

12. 【2602.17542】Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

链接https://arxiv.org/abs/2602.17542

作者:Zhangqi Duan,Arnav Kankaria,Dhruv Kartik,Andrew Lan

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Fine-grained skill representations, Fine-grained skill, skill representations, commonly referred, knowledge components

备注

点击查看摘要

Abstract:Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.

13. 【2602.17526】he Anxiety of Influence: Bloom Filters in Transformer Attention Heads

链接https://arxiv.org/abs/2602.17526

作者:Peter Balogh

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:transformer attention heads, Bloom filter, false positive, answering the question, false positive rates

备注: 13 pages, 8 figures, code at [this https URL](https://github.com/pbalogh/anxiety-of-influence) v2: L3H0 reclassified as prefix-attention head following confound control. Capacity analysis updated. Duplicate-token head overlap experiment added v3: All experiments were independently validated on CPU to rule out hardware-specific computation artifacts. Results are consistent across backends

点击查看摘要

Abstract:Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4\% even at 180 unique context tokens -- well above the $d_\text{head} = 64$ bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula $p \approx (1 - e^{-kn/m})^k$ with $R^2 = 1.0$ and fitted capacity $m \approx 5$ bits, saturating by $n \approx 20$ unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix-attention head after confound controls revealed its apparent capacity curve was a sequence-length artifact. Together, the three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), taxonomically distinct from induction and previous-token heads, with false positive rates that decay monotonically with embedding distance -- consistent with distance-sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43\% higher generalization than duplicate-token-only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.

14. 【2602.17513】Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

链接https://arxiv.org/abs/2602.17513

作者:Baris Karacan,Barbara Di Eugenio,Patrick Thornton

类目:Computation and Language (cs.CL)

关键词:vital patient information, patient information, vital patient, Clinical free-text notes, Clinical

备注: 11 pages. Accepted at LREC 2026. To appear in the proceedings

点击查看摘要

Abstract:Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

15. 【2602.17483】What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data

链接https://arxiv.org/abs/2602.17483

作者:Dimitri Staufer,Kirsten Morehouse

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:conversational agents based, Large language models, Large language, conversational agents, agents based

备注

点击查看摘要

Abstract:Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet users lack insight into how strongly models associate specific information to their identity. We audit PD across eight LLMs (3 open-source; 5 API-based, including GPT-4o), introduce LMP2 (Language Model Privacy Probe), a human-centered, privacy-preserving audit tool refined through two formative studies (N=20), and run two studies with EU residents to capture (i) intuitions about LLM-generated PD (N1=155) and (ii) reactions to tool output (N2=303). We show empirically that models confidently generate multiple PD categories for well-known individuals. For everyday users, GPT-4o generates 11 features with 60% or more accuracy (e.g., gender, hair color, languages). Finally, 72% of participants sought control over model-generated associations with their name, raising questions about what counts as PD and whether data privacy rights should extend to LLMs.

16. 【2602.17475】Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

链接https://arxiv.org/abs/2602.17475

作者:Pietro Ferrazzi,Mattia Franzin,Alberto Lavelli,Bernardo Magnini

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Large Language Models, medical Natural Language, Large Language, Language Processing

备注: Paper Accepted at LREC 2026

点击查看摘要

Abstract:Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

17. 【2602.17469】Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers

链接https://arxiv.org/abs/2602.17469

作者:Nusrat Jahan Lia,Shubhashis Roy Dipta

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:accurately understand human, systems accurately understand, understand human intent, understand human, core theme

备注

点击查看摘要

Abstract:The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.

18. 【2602.17467】PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

链接https://arxiv.org/abs/2602.17467

作者:Greta Damo,Stéphane Petiot,Elena Cabrio,Serena Villata

类目:Computation and Language (cs.CL)

关键词:online platforms poses, platforms poses significant, poses significant societal, significant societal challenges, Natural Language Processing

备注

点击查看摘要

Abstract:The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

19. 【2602.17465】Entropy-Based Data Selection for Language Models

链接https://arxiv.org/abs/2602.17465

作者:Hongming Li,Yang Liu,Chao Huang

类目:Computation and Language (cs.CL)

关键词:Modern language models, Modern language, data, Data selection, Modern

备注: IEEE Access, 15 pages, 5 figures, 11 tables

点击查看摘要

Abstract:Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (QA) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

20. 【2602.17445】ABCD: All Biases Come Disguised

链接https://arxiv.org/abs/2602.17445

作者:Mateusz Nowak,Xavier Cadet,Peter Chin

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:measuring LLMs' ability, Multiple-choice question, answer knowledge-based questions, practice for measuring, measuring LLMs'

备注: 29 pages, 20 figures, pre-print, 12 tables

点击查看摘要

Abstract:Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

21. 【2602.17443】AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

链接https://arxiv.org/abs/2602.17443

作者:Adib Sakhawat,Fardeen Sadab,Rakin Shahriar

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, capabilities of Large, multi-turn interactions, Language Models

备注: 16 pages, 5 figures, 13 tables. Includes appendix and supplementary materials

点击查看摘要

Abstract:Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

22. 【2602.17431】Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

链接https://arxiv.org/abs/2602.17431

作者:Dylan Bouchard,Mohit Singh Chauhan,Viren Bajaj,David Skarbrevik

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:closed-book hallucination detection, Uncertainty quantification, fine-grained uncertainty quantification, approach to closed-book, closed-book hallucination

备注: UQLM repository: [this https URL](https://github.com/cvs-health/uqlm)

点击查看摘要

Abstract:Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

23. 【2602.17425】Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

链接https://arxiv.org/abs/2602.17425

作者:Sanjeev Kumar,Preethi Jyothi,Pushpak Bhattacharyya

类目:Computation and Language (cs.CL)

关键词:Evaluating machine translation, scenarios poses unique, poses unique challenges, Evaluating machine, extremely low-resource language

备注: 6 pages

点击查看摘要

Abstract:Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

24. 【2602.17424】Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

链接https://arxiv.org/abs/2602.17424

作者:Anastasia Zhukova,Felix Hamborg,Karsten Donnay,Norman Meuschke,Bela Gipp

类目:Computation and Language (cs.CL)

关键词:Cross-document coreference resolution, enabling content analysis, Cross-document coreference, identifies and links, related documents

备注

点击查看摘要

Abstract:Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

25. 【2602.17413】DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing

链接https://arxiv.org/abs/2602.17413

作者:René Brinkhege,Prahlad Menon

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:current inter-organizational data, asset level, shared or withheld, current inter-organizational, inter-organizational data spaces

备注

点击查看摘要

Abstract:In current inter-organizational data spaces, usage policies are enforced mainly at the asset level: a whole document or dataset is either shared or withheld. When only parts of a document are sensitive, providers who want to avoid leaking protected information typically must manually redact documents before sharing them, which is costly, coarse-grained, and hard to maintain as policies or partners change. We present DAVE, a usage policy-enforcing LLM spokesperson that answers questions over private documents on behalf of a data provider. Instead of releasing documents, the provider exposes a natural language interface whose responses are constrained by machine-readable usage policies. We formalize policy-violating information disclosure in this setting, drawing on usage control and information flow security, and introduce virtual redaction: suppressing sensitive information at query time without modifying source documents. We describe an architecture for integrating such a spokesperson with Eclipse Dataspace Components and ODRL-style policies, and outline an initial provider-side integration prototype in which QA requests are routed through a spokesperson service instead of triggering raw document transfer. Our contribution is primarily architectural: we do not yet implement or empirically evaluate the full enforcement pipeline. We therefore outline an evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data spaces.

26. 【2602.17377】he Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

链接https://arxiv.org/abs/2602.17377

作者:Leonidas Zotos,Hedderik van Rijn,Malvina Nissim

类目:Computation and Language (cs.CL)

关键词:MCQ options, guessing is common, common practice, MCQ options operationalised, MCQ

备注: 15 pages, 4 figures

点击查看摘要

Abstract:When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.

27. 【2602.17366】RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

链接https://arxiv.org/abs/2602.17366

作者:Yiming Zhang,Siyue Zhang,Junbo Zhao,Chen Zhao

类目:Computation and Language (cs.CL)

关键词:question answering presents, answering presents significant, presents significant challenges, large language models, Long-tail question answering

备注

点击查看摘要

Abstract:Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.

28. 【2602.17327】WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

链接https://arxiv.org/abs/2602.17327

作者:Michael Dinzinger,Laura Caspari,Ali Salman,Irvin Topi,Jelena Mitrović,Michael Granitzer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:million FAQ-based natural, FAQ-based natural question-answer, natural question-answer pairs, natural question-answer, million FAQ-based

备注

点击查看摘要

Abstract:We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.

29. 【2602.17316】Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

链接https://arxiv.org/abs/2602.17316

作者:Bogdan Kostić,Conor Fallon,Julian Risch,Alexander Löser

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, advancement of Large, established standardized evaluation, Language Models

备注: Accepted at LREC 2026

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

30. 【2602.17288】ArXiv-to-Model: A Practical Study of Scientific LM Training

链接https://arxiv.org/abs/2602.17288

作者:Anuj Gupta

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrate strong reasoning, frontier large language, sources remains under-documented, scientific language model, models demonstrate strong

备注: 15 pages, 6 figures, 1 table

点击查看摘要

Abstract:While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.

31. 【2602.17287】Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

链接https://arxiv.org/abs/2602.17287

作者:Evgeniia Tokarchuk,Maya K. Nachesa,Sergey Troshin,Vlad Niculae

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Modern neural translation, Modern neural, high performance, high-resource datasets, trained on high-resource

备注

点击查看摘要

Abstract:Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.

32. 【2602.17283】owards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

链接https://arxiv.org/abs/2602.17283

作者:Yukun Chen,Xinyu Zhang,Jialong Tang,Yu Wan,Baosong Yang,Yiming Li,Zhan Qin,Kui Ren

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:detecting explicit harms, paradigms primarily focus, large language models, evaluation paradigms primarily, explicit harms

备注

点击查看摘要

Abstract:While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc 77\%$), with significant performance disparities across different languages ($\Delta Acc 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: this https URL.

33. 【2602.17262】Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

链接https://arxiv.org/abs/2602.17262

作者:Kensuke Okada,Yui Furukawa,Kyosuke Bunji

类目:Computation and Language (cs.CL); Methodology (stat.ME)

关键词:large language models, audit large language, language models, bias assessments, consistency to safety

备注

点击查看摘要

Abstract:Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

34. 【2602.17229】Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

链接https://arxiv.org/abs/2602.17229

作者:Bianca Raimondi,Maurizio Gabbrielli

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models necessitates, surface-level performance metrics, Large Language, transcend surface-level performance

备注: Preprint. Under review

点击查看摘要

Abstract:The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom's Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model's residual streams. Our results demonstrate that linear classifiers achieve approximately 95% mean accuracy across all Bloom levels, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model's representations. These findings provide evidence that the model resolves the cognitive difficulty of a prompt early in the forward pass, with representations becoming increasingly separable across layers.

35. 【2602.17221】From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

链接https://arxiv.org/abs/2602.17221

作者:Yi-Chih Huang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:reshaping knowledge work, limited methodological exploration, existing research focuses, research focuses predominantly, humanities and social

备注: also in Chinese

点击查看摘要

Abstract:Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan's this http URL usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework - a seven-stage modular workflow grounded in three principles: task modularization, human-AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data - serving as an operational demonstration of the workflow's application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single-platform data, cross-sectional design, and AI reliability risks are acknowledged.

Comments:
also in Chinese

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2602.17221 [cs.AI]

(or
arXiv:2602.17221v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.17221

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2602.17194】What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

链接https://arxiv.org/abs/2602.17194

作者:Adrian Cosma,Cosmin Dumitrache,Emilian Radoi

类目:Computation and Language (cs.CL)

关键词:deliver medical advice, mode of care, effectively in writing, Romanian text-based telemedicine, common mode

备注

点击查看摘要

Abstract:Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question--doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.

37. 【2602.17127】he Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

链接https://arxiv.org/abs/2602.17127

作者:Dusan Bosnjakovic

类目:Computation and Language (cs.CL)

关键词:standalone chat interfaces, foundational reasoning layers, Large Language Models, provider-level behavioral signatures, recursive evaluation loops

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.17127 [cs.CL]

(or
arXiv:2602.17127v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.17127

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
38. 【2602.17108】Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

链接https://arxiv.org/abs/2602.17108

作者:Anton Dzega,Aviad Elyashar,Ortal Slobodin,Odeya Cohen,Rami Puzis

类目:Computation and Language (cs.CL)

关键词:Thematic Apperception Test, Thematic Apperception, Apperception Test, Object Relations Scale, psychometrically grounded

备注

点击查看摘要

Abstract:Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.

39. 【2602.17072】BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

链接https://arxiv.org/abs/2602.17072

作者:Yunseung Lee,Subin Kim,Youngjun Kwak,Jaegul Choo

类目:Computation and Language (cs.CL)

关键词:Large language models, handle customer inquiries, Large language, based chatbots, chatbots are increasingly

备注

点击查看摘要

Abstract:Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

40. 【2602.17063】Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

链接https://arxiv.org/abs/2602.17063

作者:Akira Sakai,Yuma Ichikawa

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Sub-bit model compression, model compression seeks, compression seeks storage, Sub-bit model, aggressively compressed

备注

点击查看摘要

Abstract:Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

41. 【2602.17054】ALPS: A Diagnostic Challenge Set for Arabic Linguistic Pragmatic Reasoning

链接https://arxiv.org/abs/2602.17054

作者:Hussein S. Al-Olimat,Ahmad Alshareef

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recent Arabic NLP, Arabic NLP benchmarks, NLP benchmarks focus, deeper linguistic verification, Arabic NLP

备注

点击查看摘要

Abstract:While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.

42. 【2602.17053】RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

链接https://arxiv.org/abs/2602.17053

作者:Yunseok Han,Yejoon Lee,Jaeyoung Do

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:exhibit strong performance, Large Reasoning Models, true decision process, Large Reasoning, exhibit strong

备注: Accepted in ICLR 2026 Poster: $\href{ [this https URL](https://iclr.cc/virtual/2026/poster/10011763) }{\text{this URL}}$

点击查看摘要

Abstract:Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: $\href{this https URL}{this https URL}$

43. 【2602.17051】Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

链接https://arxiv.org/abs/2602.17051

作者:Deepak Uniyal,Md Abul Bashar,Richi Nayak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:public debates span, natural language processing, media discourse remains, Analysing multilingual social, large-scale public debates

备注

点击查看摘要

Abstract:Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

44. 【2602.17045】Large Language Models Persuade Without Planning Theory of Mind

链接https://arxiv.org/abs/2602.17045

作者:Jared Moore,Rasmus Overmark,Ned Cooper,Beba Cibralic,Nick Haber,Cameron R. Jones

类目:Computation and Language (cs.CL)

关键词:large language models, theory of mind, language models, growing body, large language

备注

点击查看摘要

Abstract:A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.

45. 【2602.17022】ReIn: Conversational Error Recovery with Reasoning Inception

链接https://arxiv.org/abs/2602.17022

作者:Takyoung Kim,Jinseok Nam,Chandrayee Basu,Xing Fan,Chengyuan Ma,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:integration achieve strong, achieve strong performance, fixed task-oriented dialogue, task-oriented dialogue datasets, large language models

备注: ICLR 2026

点击查看摘要

Abstract:Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.

46. 【2602.17004】Arcee Trinity Large Technical Report

链接https://arxiv.org/abs/2602.17004

作者:Varun Singh,Lucas Krauss,Sami Jaghouar,Matej Sirovatka,Charles Goddard,Fares Obied,Jack Min Ong,Jannik Straube,Fern,Aria Harley,Conner Stewart,Colin Kealty,Maziyar Panahi,Simon Kirsten,Anushka Deshpande,Anneketh Vij,Arthur Bresnu,Pranav Veldurthi,Raghav Ravishankar,Hardik Bishnoi,DatologyAI Team,Arcee AI Team,Prime Intellect Team,Mark McQuade,Johannes Hagemann,Lucas Atkins

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Arcee Trinity Large, Arcee Trinity, total parameters, Trinity Nano, Trinity Large

备注

点击查看摘要

Abstract:We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at this https URL.

47. 【2602.17003】Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

链接https://arxiv.org/abs/2602.17003

作者:Serin Kim,Sangam Lee,Dongha Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, lack personalization capabilities, current agents lack, advanced web agents

备注

点击查看摘要

Abstract:Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at this https URL.

48. 【2602.17001】Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

链接https://arxiv.org/abs/2602.17001

作者:Zhao Tan,Yiji Zhao,Shiyu Wang,Chang Xu,Yuxuan Liang,Xiping Liu,Shirui Pan,Ming Jin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词:Natural Language Querying, Time Series Databases, retrieve meaningful events, Natural Language, assist non-expert users

备注

点击查看摘要

Abstract:Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

49. 【2602.16997】Exploring LLMs for User Story Extraction from Mockups

链接https://arxiv.org/abs/2602.16997

作者:Diego Firmenich,Leandro Antonelli,Bruno Pazos,Fabricio Lozada,Leonardo Morales

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:define functional requirements, User stories, widely used artifacts, software industry, industry to define

备注: 14 pages, 6 figures. Preprint of the paper published in the 28th Workshop on Requirements Engineering (WER 2025)

点击查看摘要

Abstract:User stories are one of the most widely used artifacts in the software industry to define functional requirements. In parallel, the use of high-fidelity mockups facilitates end-user participation in defining their needs. In this work, we explore how combining these techniques with large language models (LLMs) enables agile and automated generation of user stories from mockups. To this end, we present a case study that analyzes the ability of LLMs to extract user stories from high-fidelity mockups, both with and without the inclusion of a glossary of the Language Extended Lexicon (LEL) in the prompts. Our results demonstrate that incorporating the LEL significantly enhances the accuracy and suitability of the generated user stories. This approach represents a step forward in the integration of AI into requirements engineering, with the potential to improve communication between users and developers.

50. 【2602.16979】Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

链接https://arxiv.org/abs/2602.16979

作者:Divyam Madaan,Sumit Chopra,Kyunghyun Cho

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Large Language, existing approaches predominantly, approaches predominantly assume

备注

点击查看摘要

Abstract:Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

51. 【2602.16976】HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing

链接https://arxiv.org/abs/2602.16976

作者:Srikumar Nayak

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Financial risk systems, formatting issues fixed, Financial risk, issues fixed, two-step routine

备注: 11 pages, 1 fig , 4 tables

点击查看摘要

Abstract:Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance. In practice, this split can break under real constraints. The prediction model may look good, but the final decision can be unstable when the market shifts, when discrete constraints are added (lot sizes, caps), or when the optimization becomes slow for larger asset sets. Also, regulated settings need a clear audit trail that links each decision to the exact model state and inputs. We present HQFS, a practical hybrid pipeline that connects forecasting, discrete risk optimization, and auditability in one flow. First, HQFS learns next-step return and a volatility proxy using a variational quantum circuit (VQC) with a small classical head. Second, HQFS converts the risk-return objective and constraints into a QUBO and solves it with quantum annealing when available, while keeping a compatible classical QUBO solver as a fallback for deployment. Third, HQFS signs each rebalance output using a post-quantum signature so the allocation can be verified later without trusting the runtime environment. On our market dataset study, HQFS reduces return prediction error by 7.8% and volatility prediction error by 6.1% versus a tuned classical baseline. For the decision layer, HQFS improves out-of-sample Sharpe by 9.4% and lowers maximum drawdown by 11.7%. The QUBO solve stage also cuts average solve time by 28% compared to a mixed-integer baseline under the same constraints, while producing fully traceable, signed allocation records.

Comments:
11 pages, 1 fig , 4 tables

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.16976 [cs.AI]

(or
arXiv:2602.16976v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.16976

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
52. 【2602.16959】Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

链接https://arxiv.org/abs/2602.16959

作者:Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Classical Persian poetry, Classical Persian, historically sustained archive, Persian poetry, intertextual convention

备注

点击查看摘要

Abstract:Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen--Shannon divergence and Kullback--Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2\% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.

53. 【2602.16957】When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

链接https://arxiv.org/abs/2602.16957

作者:Hasan Can Biyik,Libby Barak,Jing Peng,Anna Feldman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:socially sensitive expressions, substitute socially sensitive, context complicates modeling, Euphemisms substitute socially, pragmatic context complicates

备注

点击查看摘要

Abstract:Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.

54. 【2602.16938】ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

链接https://arxiv.org/abs/2602.16938

作者:Ofer Meshi,Krisztian Balog,Sally Goldman,Avi Caciularu,Guy Tennenholtz,Jihwan Jeong,Amir Globerson,Craig Boutilier

类目:Computation and Language (cs.CL)

关键词:leading to systems, simulated interactions, real world, promise of LLM-based, improve conversational

备注: EACL 2026

点击查看摘要

Abstract:The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

55. 【2602.16922】A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms

链接https://arxiv.org/abs/2602.16922

作者:Md. Ismiel Hossen Abir

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:factoring large numbers, Pollard Rho, significant risk, difficulty of factoring, Trial Division

备注

点击查看摘要

Abstract:Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard's Rho, are inefficient for large keys, while Shor's quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA's vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor's algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.

56. 【2602.16855】Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

链接https://arxiv.org/abs/2602.16855

作者:Haiyang Xu,Xi Zhang,Haowei Liu,Junyang Wang,Zhaozai Zhu,Shengjie Zhou,Xuhao Hu,Feiyu Gao,Junjie Cao,Zihua Wang,Zhiyuan Chen,Jitong Liao,Qi Zheng,Jiahui Zeng,Ze Xu,Shuai Bai,Junyang Lin,Jingren Zhou,Ming Yan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:latest native GUI, enable cloud-edge collaboration, native GUI agent, GUI agent model, GUI automation tasks

备注: 25 pages, 11 figures, 11 tables

点击查看摘要

Abstract:The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at this https URL.

57. 【2602.16852】Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

链接https://arxiv.org/abs/2602.16852

作者:Minh Duc Bui,Manuel Mager,Peter Herbert Kann,Katharina von der Wense

类目:Computation and Language (cs.CL)

关键词:Mainz carnival, German dialects, yearly celebration, German city, Meenzerisch

备注: Accepted at LREC 2026

点击查看摘要

Abstract:Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

58. 【2602.16843】BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

链接https://arxiv.org/abs/2602.16843

作者:Ahmed Rafid,Rumman Adib,Fariya Ahmed,Ajwad Abrar,Mohammed Saidul Islam

类目:Computation and Language (cs.CL)

关键词:Evaluating factual consistency, reliable text summarization, Evaluating factual, text summarization, factual consistency

备注: Accepted in 2nd LoResLM at EACL 2026

点击查看摘要

Abstract:Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $\rho = 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.

59. 【2602.16839】raining Large Reasoning Models Efficiently via Progressive Thought Encoding

链接https://arxiv.org/abs/2602.16839

作者:Zeliang Zhang,Xiaodong Liu,Hao Cheng,Hao Sun,Chenliang Xu,Jianfeng Gao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:autoregressive decoding dominates, decoding dominates time, Large reasoning models, Progressive Thought Encoding, reinforcement learning

备注: ICLR 2026, 15 pages

点击查看摘要

Abstract:Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

60. 【2602.16836】Claim Automation using Large Language Model

链接https://arxiv.org/abs/2602.16836

作者:Zhengda Mo,Zhiyu Quan,Eli O'Donohue,Kaiwen Zhong

类目:Computation and Language (cs.CL)

关键词:achieved strong performance, Large Language Models, Large Language, remains limited, general-purpose language tasks

备注: 46 pages, 12 figures. Code and data processing pipeline described

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

61. 【2602.16832】IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

链接https://arxiv.org/abs/2602.16832

作者:Priyaranjan Pattnayak,Sanchari Chowdhuri

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:leaving multilingual vulnerabilities, multilingual vulnerabilities understudied, Indic Jailbreak Robustness, large language models, South Asian languages

备注: Accepted in EACL Industry Track Oral, 2026

点击查看摘要

Abstract:Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

Comments:
Accepted in EACL Industry Track Oral, 2026

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.16832 [cs.AI]

(or
arXiv:2602.16832v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.16832

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
62. 【2602.16819】Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

链接https://arxiv.org/abs/2602.16819

作者:Yiqing Xie,Emmy Liu,Gaokai Zhang,Nachiket Kotalwar,Shubham Gandhi,Sathwik Acharya,Xingyao Wang,Carolyn Rose,Graham Neubig,Daniel Fried

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:predominant benchmarks focus, solving single issues, predominant benchmarks, issues on GitHub, assessing the quality

备注

点击查看摘要

Abstract:When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: this https URL.

63. 【2602.16813】One-step Language Modeling via Continuous Denoising

链接https://arxiv.org/abs/2602.16813

作者:Chanhyuk Lee,Jaehoon Yoo,Manan Agarwal,Sheel Shah,Jerry Huang,Aditi Raghunathan,Seunghoon Hong,Nicholas M. Boffi,Jinwoo Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:attracted widespread interest, provide faster generation, attracted widespread, widespread interest, potential to provide

备注: 39 pages, 17 figures

点击查看摘要

Abstract:Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at this https URL.

64. 【2602.16811】Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

链接https://arxiv.org/abs/2602.16811

作者:Charalampos Mastrokostas,Nikolaos Giarelis,Nikos Karacapilidis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, including Question Answering, Processing and Deep, Large Language Models, development of Large

备注

点击查看摘要

Abstract:Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

65. 【2602.16802】References Improve LLM Alignment in Non-Verifiable Domains

链接https://arxiv.org/abs/2602.16802

作者:Kejian Shi,Yixin Liu,Peifeng Wang,Alexander R. Fabbri,Shafiq Joty,Arman Cohan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Reinforcement Learning, Learning with Verifiable, lacking ground-truth verifiers, Verifiable Rewards, domains lacking ground-truth

备注: ICLR 2026 Camera Ready

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

66. 【2602.16787】Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency

链接https://arxiv.org/abs/2602.16787

作者:Victoria Lin,Xinnuo Xu,Rachel Lawrence,Risa Ueno,Amit Sharma,Javier Gonzalez,Niranjani Prasad

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, large language, suggesting weaknesses, proven brittle, brittle when presented

备注

点击查看摘要

Abstract:Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.

67. 【2602.16784】Omitted Variable Bias in Language Models Under Distribution Shift

链接https://arxiv.org/abs/2602.16784

作者:Victoria Lin,Louis-Philippe Morency,Eli Ben-Michael

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)

关键词:exhibiting brittle behavior, models remain susceptible, modern language models, language models remain, language models

备注

点击查看摘要

Abstract:Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated into observable and unobservable components, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on the worst-case generalization performance of language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.

68. 【2602.16729】Intent Laundering: AI Safety Datasets Are Not What They Seem

链接https://arxiv.org/abs/2602.16729

作者:Shahriar Golchin,Marc Wetter

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:quality of widely, triggering cues, datasets, cues, safety

备注: v1 preprint

点击查看摘要

Abstract:We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.

69. 【2602.16715】Retrieval Augmented (Knowledge Graph), and Large Language Model-Driven Design Structure Matrix (DSM) Generation of Cyber-Physical Systems

链接https://arxiv.org/abs/2602.16715

作者:H. Sinan Bank,Daniel R. Herber

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)

关键词:Large Language Models, Design Structure Matrices, generating Design Structure, Graph-based RAG, Language Models

备注: 26 pages, 10 figures

点击查看摘要

Abstract:We explore the potential of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Graph-based RAG (GraphRAG) for generating Design Structure Matrices (DSMs). We test these methods on two distinct use cases -- a power screwdriver and a CubeSat with known architectural references -- evaluating their performance on two key tasks: determining relationships between predefined components, and the more complex challenge of identifying components and their subsequent relationships. We measure the performance by assessing each element of the DSM and overall architecture. Despite design and computational challenges, we identify opportunities for automated DSM generation, with all code publicly available for reproducibility and further feedback from the domain experts.

70. 【2602.16053】Multi-Objective Alignment of Language Models for Personalized Psychotherapy

链接https://arxiv.org/abs/2602.16053

作者:Mehrab Beikzadeh,Yasaman Asadollah Salmanpour,Ashima Suvarna,Sriram Sankararaman,Matteo Malgaroli,Majid Sarrafzadeh,Saadia Gabriel

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:billion people worldwide, care remains limited, health disorders affect, Mental health disorders, billion people

备注

点击查看摘要

Abstract:Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria -- empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy -- and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

71. 【2602.16755】PREFER: An Ontology for the PREcision FERmentation Community

链接https://arxiv.org/abs/2602.16755

作者:Txell Amigó(1),Shawn Zheng Kai Tan(2),Angel Luu Phanthanourak(1),Sebastian Schulz(1),Pasquale D. Colaianni(1),Dominik M. Maszczyk(1),Ester Milesi(1),Ivan Schlembach(1),Mykhaylo Semenov Petrov(1),Marta Reventós Montané(1),Lars K. Nielsen(1,3),Jochen Förster(1),Bernhard Ø. Palsson(1,4),Suresh Sudarsan(1, 5),Alberto Santos(1) ((1) The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark, (2) SignaMind, Singapore, (3) Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, Queensland, Australia, (4) The Department of Bioengineering, University of California, San Diego, USA, (5) Nexxar ApS, Lynge, Denmark)

类目:Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:produce sustainable food, microbial cell factories, sustainable food, Precision fermentation relies, relies on microbial

备注

点击查看摘要

Abstract:Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels. Specialized laboratories such as biofoundries are advancing these processes using high-throughput bioreactor platforms, which generate vast datasets. However, the lack of community standards limits data accessibility and interoperability, preventing integration across platforms. In order to address this, we introduce PREFER, an open-source ontology designed to establish a unified standard for bioprocess data. Built in alignment with the widely adopted Basic Formal Ontology (BFO) and connecting with several other community ontologies, PREFER ensures consistency and cross-domain compatibility and covers the whole precision fermentation process. Integrating PREFER into high-throughput bioprocess development workflows enables structured metadata that supports automated cross-platform execution and high-fidelity data capture. Furthermore, PREFER's standardization has the potential to bridge disparate data silos, generating machine-actionable datasets critical for training predictive, robust machine learning models in synthetic biology. This work provides the foundation for scalable, interoperable bioprocess systems and supports the transition toward more data-driven bioproduction.

信息检索

1. 【2602.17663】CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

链接https://arxiv.org/abs/2602.17663

作者:Juri Opitz,Corina Raclé,Emanuela Boros,Andrianos Michail,Matteo Romanello,Maud Ehrmann,Simon Clematide

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:multilingual historical texts, CLEF evaluation lab, person-place relation extraction, dedicated to person-place, CLEF evaluation

备注: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at [this https URL](https://hipe-eval.github.io/HIPE-2026/)

点击查看摘要

Abstract:HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

2. 【2602.17654】Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval

链接https://arxiv.org/abs/2602.17654

作者:Jiaqi Xi,Raghav Saboo,Luming Chen,Martin Wang,Sudeep Das

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:enhance multi-category e-commerce, multi-category e-commerce search, semantic text embeddings, e-commerce search demands, scale e-commerce search

备注

点击查看摘要

Abstract:We propose a two-stage "Mine and Refine" contrastive training framework for semantic text embeddings to enhance multi-category e-commerce search retrieval. Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries while adhering to scalable supervision compatible with product and policy constraints. A practical challenge is that relevance is often graded: users accept substitutes or complements beyond exact matches, and production systems benefit from clear separation of similarity scores across these relevance strata for stable hybrid blending and thresholding. To obtain scalable policy consistent supervision, we fine-tune a lightweight LLM on human annotations under a three-level relevance guideline and further reduce residual noise via engagement driven auditing. In Stage 1, we train a multilingual Siamese two-tower retriever with a label aware supervised contrastive objective that shapes a robust global semantic space. In Stage 2, we mine hard samples via ANN and re-annotate them with the policy aligned LLM, and introduce a multi-class extension of circle loss that explicitly sharpens similarity boundaries between relevance levels, to further refine and enrich the embedding space. Robustness is additionally improved through additive spelling augmentation and synthetic query generation. Extensive offline evaluations and production A/B tests show that our framework improves retrieval relevance and delivers statistically significant gains in engagement and business impact.

3. 【2602.17544】Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

链接https://arxiv.org/abs/2602.17544

作者:Shashank Aggarwal,Ram Vikas Mishra,Amit Awekar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:LLM-based agents exchange, agents exchange intermediate, search and ranking, LLM-based agents, exchange intermediate reasoning

备注

点击查看摘要

Abstract:In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.

4. 【2602.17518】A Picture of Agentic Search

链接https://arxiv.org/abs/2602.17518

作者:Francesca Pezzuti,Ophir Frieder,Fabrizio Silvestri,Sean MacAvaney,Nicola Tonellotto

类目:Information Retrieval (cs.IR)

关键词:automated systems increasingly, systems increasingly issuing, increasingly issuing search, Information Retrieval, queries alongside humans

备注: 7 pages, 2 figures

点击查看摘要

Abstract:With automated systems increasingly issuing search queries alongside humans, Information Retrieval (IR) faces a major shift. Yet IR remains human-centred, with systems, evaluation metrics, user models, and datasets designed around human queries and behaviours. Consequently, IR operates under assumptions that no longer hold in practice, with changes to workload volumes, predictability, and querying behaviours. This misalignment affects system performance and optimisation: caching may lose effectiveness, query pre-processing may add overhead without improving results, and standard metrics may mismeasure satisfaction. Without adaptation, retrieval models risk satisfying neither humans, nor the emerging user segment of agents. However, datasets capturing agent search behaviour are lacking, which is a critical gap given IR's historical reliance on data-driven evaluation and optimisation. We develop a methodology for collecting all the data produced and consumed by agentic retrieval-augmented systems when answering queries, and we release the Agentic Search Queryset (ASQ) dataset. ASQ contains reasoning-induced queries, retrieved documents, and thoughts for queries in HotpotQA, Researchy Questions, and MS MARCO, for 3 diverse agents and 2 retrieval pipelines. The accompanying toolkit enables ASQ to be extended to new agents, retrievers, and datasets.

5. 【2602.17450】Beyond Pipelines: A Fundamental Study on the Rise of Generative-Retrieval Architectures in Web Research

链接https://arxiv.org/abs/2602.17450

作者:Amirereza Abbasi,Mohsen Hooshmand

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:offering users diverse, significantly over time, offering users, practices have evolved, evolved significantly

备注

点击查看摘要

Abstract:Web research and practices have evolved significantly over time, offering users diverse and accessible solutions across a wide range of tasks. While advanced concepts such as Web 4.0 have emerged from mature technologies, the introduction of large language models (LLMs) has profoundly influenced both the field and its applications. This wave of LLMs has permeated science and technology so deeply that no area remains untouched. Consequently, LLMs are reshaping web research and development, transforming traditional pipelines into generative solutions for tasks like information retrieval, question answering, recommendation systems, and web analytics. They have also enabled new applications such as web-based summarization and educational tools. This survey explores recent advances in the impact of LLMs-particularly through the use of retrieval-augmented generation (RAG)-on web research and industry. It discusses key developments, open challenges, and future directions for enhancing web solutions with LLMs.

6. 【2602.17442】WarpRec: Unifying Academic Rigor and Industrial Scale for Responsible, Reproducible, and Efficient Recommendation

链接https://arxiv.org/abs/2602.17442

作者:Marco Avolio,Potito Aghilar,Sabino Roccotelli,Vito Walter Anelli,Chiara Mallamaci,Vincenzo Paparella,Marco Valentini,Alejandro Bellogín,Michelantonio Trizio,Joseph Trotta,Antonio Ferrara,Tommaso Di Noia

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:complex rewriting required, Recommender Systems, complex rewriting, distributed industrial engines, researchers must choose

备注

点击查看摘要

Abstract:Innovation in Recommender Systems is currently impeded by a fractured ecosystem, where researchers must choose between the ease of in-memory experimentation and the costly, complex rewriting required for distributed industrial engines. To bridge this gap, we present WarpRec, a high-performance framework that eliminates this trade-off through a novel, backend-agnostic architecture. It includes 50+ state-of-the-art algorithms, 40 metrics, and 19 filtering and splitting strategies that seamlessly transition from local execution to distributed training and optimization. The framework enforces ecological responsibility by integrating CodeCarbon for real-time energy tracking, showing that scalability need not come at the cost of scientific integrity or sustainability. Furthermore, WarpRec anticipates the shift toward Agentic AI, leading Recommender Systems to evolve from static ranking engines into interactive tools within the Generative AI ecosystem. In summary, WarpRec not only bridges the gap between academia and industry but also can serve as the architectural backbone for the next generation of sustainable, agent-ready Recommender Systems. Code is available at this https URL

7. 【2602.17410】Improving LLM-based Recommendation with Self-Hard Negatives from Intermediate Layers

链接https://arxiv.org/abs/2602.17410

作者:Bingqian Li,Bowen Zheng,Xiaolei Wang,Long Zhang,Jinpeng Wang,Sheng Chen,Wayne Xin Zhao,Ji-rong Wen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:shown great promise, Large language models, shown great, great promise, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have shown great promise in recommender systems, where supervised fine-tuning (SFT) is commonly used for adaptation. Subsequent studies further introduce preference learning to incorporate negative samples into the training process. However, existing methods rely on sequence-level, offline-generated negatives, making them less discriminative and informative when adapting LLMs to recommendation tasks with large negative item spaces. To address these challenges, we propose ILRec, a novel preference fine-tuning framework for LLM-based recommendation, leveraging self-hard negative signals extracted from intermediate layers to improve preference learning. Specifically, we identify self-hard negative tokens from intermediate layers as fine-grained negative supervision that dynamically reflects the model's preference learning process. To effectively integrate these signals into training, we design a two-stage framework comprising cross-layer preference optimization and cross-layer preference distillation, enabling the model to jointly discriminate informative negatives and enhance the quality of negative signals from intermediate layers. In addition, we introduce a lightweight collaborative filtering model to assign token-level rewards for negative signals, mitigating the risk of over-penalizing false negatives. Extensive experiments on three datasets demonstrate ILRec's effectiveness in enhancing the performance of LLM-based recommender systems.

8. 【2602.17386】Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval

链接https://arxiv.org/abs/2602.17386

作者:Adrià Molina,Oriol Ramos Terrades,Josep Lladós

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:modern digital industry, Information retrieval lies, digital industry, modern digital, Information retrieval

备注: Submitted for ICPR Review

点击查看摘要

Abstract:Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.

9. 【2602.17354】raining-free Graph-based Imputation of Missing Modalities in Multimodal Recommendation

链接https://arxiv.org/abs/2602.17354

作者:Daniele Malitesta,Emanuele Rossi,Claudio Pomo,Tommaso Di Noia,Fragkiskos D. Malliaros

类目:Information Retrieval (cs.IR)

关键词:Multimodal recommender systems, missing modalities, Multimodal, missing, recommender systems

备注: Accepted in IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)

点击查看摘要

Abstract:Multimodal recommender systems (RSs) represent items in the catalog through multimodal data (e.g., product images and descriptions) that, in some cases, might be noisy or (even worse) missing. In those scenarios, the common practice is to drop items with missing modalities and train the multimodal RSs on a subsample of the original dataset. To date, the problem of missing modalities in multimodal recommendation has still received limited attention in the literature, lacking a precise formalisation as done with missing information in traditional machine learning. In this work, we first provide a problem formalisation for missing modalities in multimodal recommendation. Second, by leveraging the user-item graph structure, we re-cast the problem of missing multimodal information as a problem of graph features interpolation on the item-item co-purchase graph. On this basis, we propose four training-free approaches that propagate the available multimodal features throughout the item-item graph to impute the missing features. Extensive experiments on popular multimodal recommendation datasets demonstrate that our solutions can be seamlessly plugged into any existing multimodal RS and benchmarking framework while still preserving (or even widen) the performance gap between multimodal and traditional RSs. Moreover, we show that our graph-based techniques can perform better than traditional imputations in machine learning under different missing modalities settings. Finally, we analyse (for the first time in multimodal RSs) how feature homophily calculated on the item-item graph can influence our graph-based imputations.

10. 【2602.17327】WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

链接https://arxiv.org/abs/2602.17327

作者:Michael Dinzinger,Laura Caspari,Ali Salman,Irvin Topi,Jelena Mitrović,Michael Granitzer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:million FAQ-based natural, FAQ-based natural question-answer, natural question-answer pairs, natural question-answer, million FAQ-based

备注

点击查看摘要

Abstract:We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.

11. 【2602.17264】On the Reliability of User-Centric Evaluation of Conversational Recommender Systems

链接https://arxiv.org/abs/2602.17264

作者:Michael Müller,Amir Reza Mohammadi,Andreas Peintner,Beatriz Barroso Gstrein,Günther Specht,Eva Zangerle

类目:Information Retrieval (cs.IR)

关键词:Conversational Recommender Systems, assessing Conversational Recommender, Recommender Systems, Conversational Recommender, capture subjective qualities

备注: 5 pages, 2 figures. Submitted to UMAP 2026. Code available at [this https URL](https://github.com/michael-mue/reliable-crs-eval)

点击查看摘要

Abstract:User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation, whereas socially grounded constructs such as humanness and rapport are substantially less reliable. Furthermore, many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments. These findings challenge the validity of single-annotator and LLM-based evaluation protocols and motivate the need for multi-rater aggregation and dimension reduction in offline CRS evaluation.

12. 【2602.17170】When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

链接https://arxiv.org/abs/2602.17170

作者:Chuting Yu,Hang Li,Joel Mackenzie,Teerapong Leelanupab

类目:Information Retrieval (cs.IR)

关键词:Information Retrieval evaluation, Information Retrieval, cognitively intensive, limiting the scalability, Retrieval evaluation

备注

点击查看摘要

Abstract:Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

13. 【2602.17099】Multiple Index Merge for Approximate Nearest Neighbor Search

链接https://arxiv.org/abs/2602.17099

作者:Liuchang Jing,Mingyu Yang,Lei Li,Jianbin Qin,Wei Wang

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:Proximity Graph-based indexes, widespread applications, Proximity Graph-based, foundational problem, databases with widespread

备注: technical report

点击查看摘要

Abstract:Approximate $k$ nearest neighbor (AKNN) search in high-dimensional space is a foundational problem in vector databases with widespread applications. Among the numerous AKNN indexes, Proximity Graph-based indexes achieve state-of-the-art search efficiency across various benchmarks. However, their extensive distance computations of high-dimensional vectors lead to slow construction and substantial memory overhead. The limited memory capacity often prevents building the entire index at once when handling large-scale datasets. A common practice is to build multiple sub-indexes separately. However, directly searching on these separated indexes severely compromises search efficiency, as queries cannot leverage cross-graph connections. Therefore, efficient graph index merging is crucial for multi-index searching. In this paper, we focus on efficient two-index merging and the merge order of multiple indexes for AKNN search. To achieve this, we propose a reverse neighbor sliding merge (RNSM) that exploits structural information to boost merging efficiency. We further investigate merge order selection (MOS) to reduce the merging cost by eliminating redundant merge operations. Experiments show that our approach yields up to a 5.48$\times$ speedup over existing index merge methods and 9.92$\times$ speedup over index reconstruction, while maintaining expected superior search performance. Moreover, our method scales efficiently to 100 million vectors with 50 partitions, maintaining consistent speedups.

14. 【2602.17058】A Long-term Value Prediction Framework In Video Ranking

链接https://arxiv.org/abs/2602.17058

作者:Huabin Chen,Xinao Wang,Huiping Chu,Keqin Xu,Chenhao Zhai,Chenyi Wang,Kai Meng,Yuning Jiang

类目:Information Retrieval (cs.IR)

关键词:recommendation remains challenging, Accurately modeling long-term, short-video recommendation remains, remains challenging, Accurately modeling

备注: 9 pages

点击查看摘要

Abstract:Accurately modeling long-term value (LTV) at the ranking stage of short-video recommendation remains challenging. While delayed feedback and extended engagement have been explored, fine-grained attribution and robust position normalization at billion-scale are still underdeveloped. We propose a practical ranking-stage LTV framework addressing three challenges: position bias, attribution ambiguity, and temporal limitations. (1) Position bias: We introduce a Position-aware Debias Quantile (PDQ) module that normalizes engagement via quantile-based distributions, enabling position-robust LTV estimation without architectural changes. (2) Attribution ambiguity: We propose a multi-dimensional attribution module that learns continuous attribution strengths across contextual, behavioral, and content signals, replacing static rules to capture nuanced inter-video influence. A customized hybrid loss with explicit noise filtering improves causal clarity. (3) Temporal limitations: We present a cross-temporal author modeling module that builds censoring-aware, day-level LTV targets to capture creator-driven re-engagement over longer horizons; the design is extensible to other dimensions (e.g., topics, styles). Offline studies and online A/B tests show significant improvements in LTV metrics and stable trade-offs with short-term objectives. Implemented as task augmentation within an existing ranking model, the framework supports efficient training and serving, and has been deployed at billion-scale in Taobao's production system, delivering sustained engagement gains while remaining compatible with industrial constraints.

Comments:
9 pages

Subjects:

Information Retrieval (cs.IR)

Reportnumber:
ind0066

Cite as:
arXiv:2602.17058 [cs.IR]

(or
arXiv:2602.17058v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.17058

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3774904.3792830

Focus to learn more

            DOI(s) linking to related resources</p>
15. 【2602.17036】LiveGraph: Active-Structure Neural Re-ranking for Exercise Recommendation

链接https://arxiv.org/abs/2602.17036

作者:Rong Fu,Zijian Zhang,Haiyun Wei,Jiekai Wu,Kun Liu,Xianda Li,Haoyu Zhao,Yang Li,Yongtai Liu,Ziming Wang,Rui Lu,Simon Fong

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:intelligent systems capable, providing personalized educational, digital learning environments, personalized educational content, continuous expansion

备注: 19 pages, 5 figures

点击查看摘要

Abstract:The continuous expansion of digital learning environments has catalyzed the demand for intelligent systems capable of providing personalized educational content. While current exercise recommendation frameworks have made significant strides, they frequently encounter obstacles regarding the long-tailed distribution of student engagement and the failure to adapt to idiosyncratic learning trajectories. We present LiveGraph, a novel active-structure neural re-ranking framework designed to overcome these limitations. Our approach utilizes a graph-based representation enhancement strategy to bridge the information gap between active and inactive students while integrating a dynamic re-ranking mechanism to foster content diversity. By prioritizing the structural relationships within learning histories, the proposed model effectively balances recommendation precision with pedagogical variety. Comprehensive experimental evaluations conducted on multiple real-world datasets demonstrate that LiveGraph surpasses contemporary baselines in both predictive accuracy and the breadth of exercise diversity.

16. 【2602.16989】WSDM Cup 2026 Multilingual Retrieval: A Low-Cost Multi-Stage Retrieval Pipeline

链接https://arxiv.org/abs/2602.16989

作者:Chentong Hao,Minmao Wang

类目:Information Retrieval (cs.IR)

关键词:retrieve relevant documents, approximately ten million, multilingual retrieval task, WSDM Cup, articles in Chinese

备注

点击查看摘要

Abstract:We present a low-cost retrieval system for the WSDM Cup 2026 multilingual retrieval task, where English queries are used to retrieve relevant documents from a collection of approximately ten million news articles in Chinese, Persian, and Russian, and to output the top-1000 ranked results for each query. We follow a four-stage pipeline that combines LLM-based GRF-style query expansion with BM25 candidate retrieval, dense ranking using long-text representations from jina-embeddings-v4, and pointwise re-ranking of the top-20 candidates using Qwen3-Reranker-4B while preserving the dense order for the remaining results. On the official evaluation, the system achieves nDCG@20 of 0.403 and Judged@20 of 0.95. We further conduct extensive ablation experiments to quantify the contribution of each stage and to analyze the effectiveness of query expansion, dense ranking, and top-$k$ reranking under limited compute budgets.

17. 【2602.16986】Bending the Scaling Law Curve in Large-Scale Recommendation Systems

链接https://arxiv.org/abs/2602.16986

作者:Qin Ding,Kevin Course,Linjian Ma,Jianhui Sun,Rouchen Liu,Zhao Zhu,Chunxing Yin,Wei Li,Dai Li,Yu Shi,Xuan Cao,Ze Yang,Han Li,Xing Liu,Bi Xue,Hongwei Li,Rui Jian,Daisy Shi He,Jing Qian,Matt Ma,Qunshu Zhang,Rui Li

类目:Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:large-scale recommender systems, user interaction history, interaction history, cornerstone of large-scale, large-scale recommender

备注

点击查看摘要

Abstract:Learning from user interaction history through sequential models has become a cornerstone of large-scale recommender systems. Recent advances in large language models have revealed promising scaling laws, sparking a surge of research into long-sequence modeling and deeper architectures for recommendation tasks. However, many recent approaches rely heavily on cross-attention mechanisms to address the quadratic computational bottleneck in sequential modeling, which can limit the representational power gained from self-attention. We present ULTRA-HSTU, a novel sequential recommendation model developed through end-to-end model and system co-design. By innovating in the design of input sequences, sparse attention mechanisms, and model topology, ULTRA-HSTU achieves substantial improvements in both model quality and efficiency. Comprehensive benchmarking demonstrates that ULTRA-HSTU achieves remarkable scaling efficiency gains -- over 5x faster training scaling and 21x faster inference scaling compared to conventional models -- while delivering superior recommendation quality. Our solution is fully deployed at scale, serving billions of users daily and driving significant 4% to 8% consumption and engagement improvements in real-world production environments.

18. 【2602.16974】Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval

链接https://arxiv.org/abs/2602.16974

作者:Yongjie Zhou,Shuai Wang,Bevan Koopman,Guido Zuccon

类目:Information Retrieval (cs.IR)

关键词:remains poorly understood, critical preprocessing step, dense retrieval systems, strategies remains poorly, poorly understood

备注: Github link will be pushed later as it's anonymoused at the moment

点击查看摘要

Abstract:Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chunking). Our reproduction evaluates these approaches in two distinct retrieval settings established in previous work: in-document retrieval (needle-in-a-haystack) and in-corpus retrieval (the standard information retrieval task). Our comprehensive evaluation reveals that optimal chunking strategies are task-dependent: simple structure-based methods outperform LLM-guided alternatives for in-corpus retrieval, while LumberChunker performs best for in-document retrieval. Contextualized chunking improves in-corpus effectiveness but degrades in-document retrieval. We also find that chunk size correlates moderately with in-document but weakly with in-corpus effectiveness, suggesting segmentation method differences are not purely driven by chunk size. Our code and evaluation benchmarks are publicly available at (Anonymoused).

Comments:
Github link will be pushed later as it’s anonymoused at the moment

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2602.16974 [cs.IR]

(or
arXiv:2602.16974v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.16974

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
19. 【2602.16964】SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

链接https://arxiv.org/abs/2602.16964

作者:Prasham Titiya,Rohit Khoja,Tomer Wolfson,Vivek Gupta,Dan Roth

类目:Information Retrieval (cs.IR)

关键词:Retrieval-augmented question answering, Retrieval-augmented question, requires connected evidence, heterogeneous corpora requires, corpora requires connected

备注

点击查看摘要

Abstract:Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k' additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.

20. 【2602.16932】RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution

链接https://arxiv.org/abs/2602.16932

作者:Jinming Nian,Fangchen Li,Dae Hoon Park,Yi Fang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:efficient first-stage rankers, smoothing remain strong, Dirichlet smoothing remain, first-stage rankers, human intuition

备注

点击查看摘要

Abstract:Retrieval algorithms like BM25 and query likelihood with Dirichlet smoothing remain strong and efficient first-stage rankers, yet improvements have mostly relied on parameter tuning and human intuition. We investigate whether a large language model, guided by an evaluator and evolutionary search, can automatically discover improved lexical retrieval algorithms. We introduce RankEvolve, a program evolution setup based on AlphaEvolve, in which candidate ranking algorithms are represented as executable code and iteratively mutated, recombined, and selected based on retrieval performance across 12 IR datasets from BEIR and BRIGHT. RankEvolve starts from two seed programs: BM25 and query likelihood with Dirichlet smoothing. The evolved algorithms are novel, effective, and show promising transfer to the full BEIR and BRIGHT benchmarks as well as TREC DL 19 and 20. Our results suggest that evaluator-guided LLM program evolution is a practical path towards automatic discovery of novel ranking algorithms.

计算机视觉

1. 【2602.17665】OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

链接https://arxiv.org/abs/2602.17665

作者:Akashah Shabbir,Muhammad Umer Sheikh,Muhammad Akhtar Munir,Hiyam Debary,Mustansar Fiaz,Muhammad Zaigham Zaheer,Paolo Fraccaro,Fahad Shahbaz Khan,Muhammad Haris Khan,Xiao Xiang Zhu,Salman Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structured analytical tasks, progress in multimodal, perform structured analytical, interpret imagery, reasoning

备注

点击查看摘要

Abstract:Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

2. 【2602.17659】When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

链接https://arxiv.org/abs/2602.17659

作者:Yu Fang,Yuchun Feng,Dong Jing,Jiaqi Liu,Yue Yang,Zhenyu Wei,Daniel Szafir,Mingyu Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:faithfully follow language, promise to ground, robot control, practice often fail, fail to faithfully

备注: Website: [this https URL](https://vla-va.github.io/)

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $\pi_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

3. 【2602.17650】Human-level 3D shape perception emerges from multi-view learning

链接https://arxiv.org/abs/2602.17650

作者:Tyler Bonnen,Jitendra Malik,Angjoo Kanazawa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:two-dimensional visual inputs, infer the three-dimensional, three-dimensional structure, human, visual inputs

备注

点击查看摘要

Abstract:Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

4. 【2602.17645】Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

链接https://arxiv.org/abs/2602.17645

作者:Xiaohan Zhao,Zhaoyi Li,Yaxin Luo,Jiacheng Cui,Zhiqiang Shen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, complex multimodal boundaries, Vision-Language Models, Large Vision-Language, multimodal boundaries

备注: Code at: [this https URL](https://github.com/vila-lab/M-Attack-V2)

点击查看摘要

Abstract:Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: this https URL.

5. 【2602.17639】IntRec: Intent-based Retrieval with Contrastive Refinement

链接https://arxiv.org/abs/2602.17639

作者:Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,Yue Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Retrieving user-specified objects, involve multiple similar, Retrieving user-specified, complex scenes remains, multiple similar objects

备注

点击查看摘要

Abstract:Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

6. 【2602.17636】CORAL: Correspondence Alignment for Improved Virtual Try-On

链接https://arxiv.org/abs/2602.17636

作者:Jiyoung Kim,Youngjin Shin,Siyoon Jin,Dahyun Chung,Jisu Nam,Tongmin Kim,Jongjae Park,Hyeonwoo Kang,Seungryong Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserve fine garment, Virtual Try-On, Existing methods, fine garment details, Diffusion Transformers

备注: 32 pages, 25 figures

点击查看摘要

Abstract:Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

7. 【2602.17605】Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

链接https://arxiv.org/abs/2602.17605

作者:Jowaria Khan,Anindya Sarkar,Yevgeniy Vorobeychik,Elizabeth Bondi-Kelly

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:tight resource constraints, difficult data collection, efficiently uncovering hidden, disaster response, environmental monitoring

备注

点击查看摘要

Abstract:In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.

8. 【2602.17599】Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

链接https://arxiv.org/abs/2602.17599

作者:Ivan Rinaldi,Matteo Mendula,Nicola Fanelli,Florence Levé,Matteo Testi,Giovanna Castellano,Gennaro Vessio

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:multimodal deep learning, Free Music Archive, advanced markedly, synthesize audio, audio from text

备注

点击查看摘要

Abstract:Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

9. 【2602.17573】FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations

链接https://arxiv.org/abs/2602.17573

作者:Konstantinos Foteinos,Georgios Angelidis,Aggelos Psiris,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:increasing intensity, intensity and number, number of disasters, disasters make, difficult the work

备注

点击查看摘要

Abstract:The ever increasing intensity and number of disasters make even more difficult the work of First Responders (FRs). Artificial intelligence and robotics solutions could facilitate their operations, compensating these difficulties. To this end, we propose a dataset for gesture-based UGV control by FRs, introducing a set of 12 commands, drawing inspiration from existing gestures used by FRs and tactical hand signals and refined after incorporating feedback from experienced FRs. Then we proceed with the data collection itself, resulting in 3312 RGBD pairs captured from 2 viewpoints and 7 distances. To the best of our knowledge, this is the first dataset especially intended for gesture-based UGV guidance by FRs. Finally we define evaluation protocols for our RGBD dataset, termed FR-GESTURE, and we perform baseline experiments, which are put forward for improvement. We have made data publicly available to promote future research on the domain: this https URL.

10. 【2602.17558】RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

链接https://arxiv.org/abs/2602.17558

作者:Qiucheng Wu,Jing Shi,Simon Jenni,Kushal Kafle,Tianyu Wang,Shiyu Chang,Handong Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, large language models, multimodal large language, extending vision-language reasoning, shown great potential

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

11. 【2602.17555】GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

链接https://arxiv.org/abs/2602.17555

作者:Zixu Cheng,Da Li,Jian Hu,Ziquan Liu,Wei Li,Shaogang Gong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video reasoning, Video, reasoning requires understanding, Video reasoning requires, reasoning

备注: Under review

点击查看摘要

Abstract:Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

12. 【2602.17535】LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

链接https://arxiv.org/abs/2602.17535

作者:Behzad Bozorgtabar,Dwarikanath Mahapatra,Sudipta Roy,Muzammal Naseer,Imran Razzak,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:domain shift hinges, strong zero-shot recognizers, Medical vision-language models, textbf, reliability under domain

备注: 18 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.

13. 【2602.17517】FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

链接https://arxiv.org/abs/2602.17517

作者:Hanyuan Zhang,Lucas He,Runlong He,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangelos B. Mazomenos,Matthew J. Clarkson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improve tumor localization, Augmented reality, laparoscopic liver surgery, liver surgery, reality can improve

备注

点击查看摘要

Abstract:Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.

14. 【2602.17484】racing Copied Pixels and Regularizing Patch Affinity in Copy Detection

链接https://arxiv.org/abs/2602.17484

作者:Yichen Lu,Siwei Nie,Minlong Lu,Xudong Yang,Xiaobo Zhang,Peng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Image Copy Detection, Copy Detection, Image Copy, robust feature representation, feature representation learning

备注

点击查看摘要

Abstract:Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.

15. 【2602.17478】QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

链接https://arxiv.org/abs/2602.17478

作者:Xuan-Bac Nguyen,Hoang-Quan Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Characterizing two-dimensional quantum, subtle layer-dependent contrast, Characterizing two-dimensional, limited labeled data, optical microscopy images

备注: Project page: [this https URL](https://uark-cviu.github.io/projects/qupaint/)

点击查看摘要

Abstract:Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.

16. 【2602.17473】4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

链接https://arxiv.org/abs/2602.17473

作者:Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Yirui Li,Hao Liu,Lijun Han,Hesheng Wang,Shing Shin Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing deformable surgical, Reconstructing deformable, clinically important, videos is challenging, challenging and clinically

备注: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file Subjects

点击查看摘要

Abstract:Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: this https URL.

17. 【2602.17419】EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

链接https://arxiv.org/abs/2602.17419

作者:Xiaomeng Peng,Xilang Huang,Seon Han Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited semantic explanations, deep learning approaches, learning approaches produce, provide limited semantic, Industrial anomaly detection

备注

点击查看摘要

Abstract:Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{this https URL}{this https URL}

18. 【2602.17397】A High-Level Survey of Optical Remote Sensing

链接https://arxiv.org/abs/2602.17397

作者:Panagiotis Koletsis,Vasilis Efthymiou,Maria Vakalopoulou,Nikos Komodakis,Anastasios Doulamis,Georgios Th. Papadopoulos

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:recent years, significant advances, advances in computer, computer vision, propelled progress

备注

点击查看摘要

Abstract:In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.

19. 【2602.17395】SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

链接https://arxiv.org/abs/2602.17395

作者:Lorenzo Caselli,Marco Mistretta,Simone Magistri,Andrew D. Bagdanov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Generalized Category Discovery, Generalized Category, Category Discovery, small labeled subset, aims to identify

备注: Accepted at ICLR 2026. Code available at [this https URL](https://github.com/miccunifi/SpectralGCD)

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: this https URL.

20. 【2602.17387】DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

链接https://arxiv.org/abs/2602.17387

作者:Changhun Kim,Martin Mayr,Thomas Gorges,Fei Wu,Mathias Seuret,Andreas Maier,Vincent Christlein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:handwritten text recognition, makes decoding slow, handwritten text, text recognition, systems commonly

备注: Submitted to Pattern Recognition, 11 pages + 2-page appendix, 7 figures, 12 tables

点击查看摘要

Abstract:State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

21. 【2602.17372】ree crop mapping of South America reveals links to deforestation and conservation

链接https://arxiv.org/abs/2602.17372

作者:Yuchang Jiang,Anton Raichuk,Xiaoye Tong,Vivien Sainte Fare Garnot,Daniel Ortiz-Gonzalo,Dan Morris,Konrad Schindler,Jan Dirk Wegner,Maxim Neumann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:European Union Regulation, Deforestation-free Products, European Union, Union Regulation, Regulation on Deforestation-free

备注

点击查看摘要

Abstract:Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union's Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as "forest". This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.

22. 【2602.17353】Application and Evaluation of the Common Circles Method

链接https://arxiv.org/abs/2602.17353

作者:Michael Quellmalz,Mia Kvåle Løvmo,Simon Moser,Franziska Strasser,Monika Ritsch-Marte

类目:Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)

关键词:optical diffraction tomography, sized biological tissue, sub-millimeter sized biological, common circle method, diffraction tomography

备注

点击查看摘要

Abstract:We investigate the application of the common circle method for estimating sample motion in optical diffraction tomography (ODT) of sub-millimeter sized biological tissue. When samples are confined via contact-free acoustical force fields, their motion must be estimated from the captured images. The common circle method identifies intersections of Ewald spheres in Fourier space to determine rotational motion. This paper presents a practical implementation, incorporating temporal consistency constraints to achieve stable reconstructions. Our results on both simulated and real-world data demonstrate that the common circle method provides a computationally efficient alternative to full optimization methods for motion detection.

23. 【2602.17337】Polaffini: A feature-based approach for robust affine and polyaffine image registration

链接https://arxiv.org/abs/2602.17337

作者:Antoine Legouhy,Cosimo Campo,Ross Callaghan,Hojjat Azadbakht,Hui Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:work we present, registration, Polaffini, present Polaffini, versatile framework

备注: associated github repo: [this https URL](https://github.com/CIG-UCL/polaffini)

点击查看摘要

Abstract:In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

24. 【2602.17322】Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline

链接https://arxiv.org/abs/2602.17322

作者:Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging task due, Detecting tampered text, Detecting tampered, challenging task, task due

备注

点击查看摘要

Abstract:Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model's ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.

25. 【2602.17321】he Sound of Death: Deep Learning Reveals Vascular Damage from Carotid Ultrasound

链接https://arxiv.org/abs/2602.17321

作者:Christoph Balada,Aida Romano-Martinez,Payal Varshney,Vincent ten Cate,Katharina Geschke,Jonas Tesarz,Paul Claßen,Alexander K. Schuster,Dativa Tibyampansha,Karl-Patrik Kresoja,Philipp S. Wild,Sheraz Ahmed,Andreas Dengel

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:early risk detection, remain the leading, Carotid ultrasound, Cardiovascular diseases, mortality worldwide

备注

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, yet early risk detection is often limited by available diagnostics. Carotid ultrasound, a non-invasive and widely accessible modality, encodes rich structural and hemodynamic information that is largely untapped. Here, we present a machine learning (ML) framework that extracts clinically meaningful representations of vascular damage (VD) from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust features that are biologically plausible, interpretable, and strongly associated with established cardiovascular risk factors, comorbidities, and laboratory measures. High VD stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models such as SCORE2. Explainable AI analyses reveal that the model relies on vessel morphology and perivascular tissue characteristics, uncovering novel functional and anatomical signatures of vascular damage. This work demonstrates that routine carotid ultrasound contains far more prognostic information than previously recognized. Our approach provides a scalable, non-invasive, and cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.

26. 【2602.17310】Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery

链接https://arxiv.org/abs/2602.17310

作者:Dennis N. Schneider,Lars Wagner,Daniel Rueckert,Dirk Wilhelm

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate grasping point, minimally invasive surgery, Accurate grasping, key challenge, minimally invasive

备注

点击查看摘要

Abstract:Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.

27. 【2602.17277】Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution

链接https://arxiv.org/abs/2602.17277

作者:Ruoyi Zhang,Jiawei Yuan,Lujia Ye,Runling Yu,Liling Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-resolution satellite imagery, High-resolution satellite, Generative Adversarial Network, tracking the genesis, tropical cyclones

备注: Under review

点击查看摘要

Abstract:High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.

28. 【2602.17270】Unified Latents (UL): How to train your latents

链接https://arxiv.org/abs/2602.17270

作者:Jonathan Heek,Emiel Hoogeboom,Thomas Mensink,Tim Salimans

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:present Unified Latents, learning latent representations, present Unified, Unified Latents, Stable Diffusion latents

备注

点击查看摘要

Abstract:We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

29. 【2602.17260】EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

链接https://arxiv.org/abs/2602.17260

作者:Hung Mai,Loi Dinh,Duc Hai Nguyen,Dat Do,Luong Doan,Khanh Nguyen Quoc,Huan Vu,Phong Ho,Naeem Ul Islam,Tuan Do

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computationally heavy MLLMs, produced highly realistic, highly realistic synthetic, shallow embedding trajectories, Recent advances

备注: First preprint

点击查看摘要

Abstract:Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.

30. 【2602.17252】A Multi-modal Detection System for Infrastructure-based Freight Signal Priority

链接https://arxiv.org/abs/2602.17252

作者:Ziyan Zhang,Chuheng Wei,Xuanpeng Zhao,Siyan Li,Will Snyder,Mike Stas,Peng Hao,Kanok Boriboonsomsin,Guoyuan Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)

关键词:Freight Signal Priority, approaching signalized intersections, signalized intersections require, intersections require reliable, infrastructure-based Freight Signal

备注: 12 pages, 15 figures. Accepted at ICTD 2026. Final version to appear in ASCE Proceedings

点击查看摘要

Abstract:Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.

31. 【2602.17250】Inferring Height from Earth Embeddings: First insights using Google AlphaEarth

链接https://arxiv.org/abs/2602.17250

作者:Alireza Hamoudzadeh,Valeria Belloni,Roberta Ravanelli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal features encoded, Digital Surface Model, Earth Embeddings, guide deep learning, effectively guide deep

备注: 29 pages, 9 figures

点击查看摘要

Abstract:This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.

32. 【2602.17231】HiMAP: History-aware Map-occupancy Prediction with Fallback

链接https://arxiv.org/abs/2602.17231

作者:Yiming Xu,Yi Yang,Hao Cheng,Monika Sester

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate motion forecasting, Accurate motion, autonomous driving, assuming that objects, continuously tracked

备注: Accepted in 2026 IEEE International Conference on Robotics and Automation

点击查看摘要

Abstract:Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11\% in FDE, 12\% in ADE, and a 4\% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: this https URL.

33. 【2602.17200】GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

链接https://arxiv.org/abs/2602.17200

作者:Ye Zhu,Kaleb S. Newman,Johannes F. Lutzeyer,Adriana Romero-Soriano,Michal Drozdzal,Olga Russakovsky

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesize diverse images, generative models, models still struggle, struggle to synthesize, synthesize diverse

备注: Preprint. Code will be available at [this https URL](https://github.com/L-YeZhu/GASS_T2I)

点击查看摘要

Abstract:Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

34. 【2602.17196】EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

链接https://arxiv.org/abs/2602.17196

作者:Yahong Wang,Juncheng Wu,Zhangkai Ni,Chengmei Yang,Yihang Liu,Longzhen Yang,Yuyin Zhou,Ying Wen,Lianghua He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incur substantial inference, substantial inference cost, inference cost due, large language models, Multimodal large language

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at this https URL.

35. 【2602.17189】xo: Formula Recognition within 20M Parameters

链接https://arxiv.org/abs/2602.17189

作者:Sicheng Mao

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:highperformance formula recognition, formula recognition model, million parameters, present Texo, paper we present

备注

点击查看摘要

Abstract:In this paper we present Texo, a minimalist yet highperformance formula recognition model that contains only 20 million parameters. By attentive design, distillation and transfer of the vocabulary and the tokenizer, Texo achieves comparable performance to state-of-the-art models such as UniMERNet-T and PPFormulaNet-S, while reducing the model size by 80% and 65%, respectively. This enables real-time inference on consumer-grade hardware and even in-browser deployment. We also developed a web application to demonstrate the model capabilities and facilitate its usage for end users.

36. 【2602.17186】Selective Training for Large Vision Language Models via Visual Information Gain

链接https://arxiv.org/abs/2602.17186

作者:Seulbi Lee,Sangheum Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision Language, Vision Language Models, Large Vision, achieved remarkable progress, Language Models

备注

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

37. 【2602.17182】NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting

链接https://arxiv.org/abs/2602.17182

作者:Jiwei Shan,Zeyu Cai,Yirui Li,Yongbo Chen,Lijun Han,Yun-hui Liu,Hesheng Wang,Shing Shin Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Visual simultaneous localization, Visual simultaneous, monocular non-rigid SLAM, non-rigid SLAM, perception and navigation

备注

点击查看摘要

Abstract:Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50\% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.

38. 【2602.17168】BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

链接https://arxiv.org/abs/2602.17168

作者:Siyuan Liang,Yongcheng Jing,Yingjie Wang,Jiaxing Huang,Ee-chien Chang,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal contrastive learning, contrastive learning models, learning models faces, multimodal contrastive, contrastive learning

备注: 25 pages, 10 figures

点击查看摘要

Abstract:Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.

39. 【2602.17134】B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

链接https://arxiv.org/abs/2602.17134

作者:Hiromichi Kamata,Samuel Arthur Munro,Fuminori Homma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, game production, essential for real-time, real-time editing, editing of pre-reconstructed

备注: Project page: [this https URL](https://sony.github.io/B3-Seg-project/)

点击查看摘要

Abstract:Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

40. 【2602.17124】3D Scene Rendering with Multimodal Gaussian Splatting

链接https://arxiv.org/abs/2602.17124

作者:Chi-Shiang Gau,Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Tara Javidi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:spanning industrial monitoring, applications spanning industrial, computer vision, industrial monitoring, autonomous driving

备注

点击查看摘要

Abstract:3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.

41. 【2602.17101】Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success

链接https://arxiv.org/abs/2602.17101

作者:Varun Burde,Pavel Burget,Torsten Sattler

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:foundational layer, layer for numerous, numerous robotic perception, pose, numerous robotic

备注

点击查看摘要

Abstract:3D reconstruction serves as the foundational layer for numerous robotic perception tasks, including 6D object pose estimation and grasp pose generation. Modern 3D reconstruction methods for objects can produce visually and geometrically impressive meshes from multi-view images, yet standard geometric evaluations do not reflect how reconstruction quality influences downstream tasks such as robotic manipulation performance. This paper addresses this gap by introducing a large-scale, physics-based benchmark that evaluates 6D pose estimators and 3D mesh models based on their functional efficacy in grasping. We analyze the impact of model fidelity by generating grasps on various reconstructed 3D meshes and executing them on the ground-truth model, simulating how grasp poses generated with an imperfect model affect interaction with the real object. This assesses the combined impact of pose error, grasp robustness, and geometric inaccuracies from 3D reconstruction. Our results show that reconstruction artifacts significantly decrease the number of grasp pose candidates but have a negligible effect on grasping performance given an accurately estimated pose. Our results also reveal that the relationship between grasp success and pose error is dominated by spatial error, and even a simple translation error provides insight into the success of the grasping pose of symmetric objects. This work provides insight into how perception systems relate to object manipulation using robots.

42. 【2602.17085】ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions

链接https://arxiv.org/abs/2602.17085

作者:Shogo Sato,Kazuo Tanaka,Shojun Ogasawara,Kazuki Yamamoto,Kazuhiko Murasaki,Ryuichi Tanida,Jun Kataoka

类目:Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)

关键词:energetic transient phenomena, Gamma-ray bursts, high-energy astrophysical processes, energetic transient, transient phenomena

备注: Accepted by ApJ

点击查看摘要

Abstract:Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.

43. 【2602.17077】Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection

链接https://arxiv.org/abs/2602.17077

作者:Lee Dayeon,Kim Dongheyong,Park Chaewon,Woo Sungmin,Lee Sangyoun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly supervised video, Weakly supervised, supervised video anomaly, supervised video, aims to detect

备注: ICASSP 2026

点击查看摘要

Abstract:Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.

44. 【2602.17063】Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

链接https://arxiv.org/abs/2602.17063

作者:Akira Sakai,Yuma Ichikawa

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Sub-bit model compression, model compression seeks, compression seeks storage, Sub-bit model, aggressively compressed

备注

点击查看摘要

Abstract:Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

45. 【2602.17060】Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding

链接https://arxiv.org/abs/2602.17060

作者:Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:persistently occlude laparoscopic, occlude laparoscopic views, attract disproportionate feature, disproportionate feature points, feature points due

备注

点击查看摘要

Abstract:Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.

46. 【2602.17048】StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

链接https://arxiv.org/abs/2602.17048

作者:Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:facto standard, standard for converting, unsupervised anomaly detection, Max pooling, UAD

备注

点击查看摘要

Abstract:Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.17048 [cs.CV]

(or
arXiv:2602.17048v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.17048

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
47. 【2602.17047】Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

链接https://arxiv.org/abs/2602.17047

作者:Chaojie Yang,Tian Li,Yue Zhang,Jun Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformer, prohibitive computational costs, significantly advanced, generation but suffer, deployment barriers

备注

点击查看摘要

Abstract:Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

48. 【2602.17033】PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

链接https://arxiv.org/abs/2602.17033

作者:Peize Li,Zeyu Zhang,Hao Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structure remains challenging, learned priors struggle, existing systems provide, systems provide limited, provide limited support

备注

点击查看摘要

Abstract:Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: this https URL. Website: this https URL.

49. 【2602.17030】Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings

链接https://arxiv.org/abs/2602.17030

作者:Eric Chen,Patricia Alves-Oliveira

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:legal contexts, increasingly involved, documenting authorship, creative production, collectors

备注

点击查看摘要

Abstract:As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.

50. 【2602.16979】Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

链接https://arxiv.org/abs/2602.16979

作者:Divyam Madaan,Sumit Chopra,Kyunghyun Cho

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Large Language, existing approaches predominantly, approaches predominantly assume

备注

点击查看摘要

Abstract:Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

51. 【2602.16968】DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

链接https://arxiv.org/abs/2602.16968

作者:Dahye Kim,Deepti Ghadiyaram,Raghudeep Gadde

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion Transformers, heavy computation, Diffusion, Transformers, content complexity

备注

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on this http URL and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

52. 【2602.16950】HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs

链接https://arxiv.org/abs/2602.16950

作者:Kibon Ku,Talukder Z. Jubery,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advancing agricultural sustainability, enabled accurate, plant phenotypes, breeding programs, HSI captures detailed

备注: 16 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.

53. 【2602.16918】Xray-Visual Models: Scaling Vision models on Industry Scale Data

链接https://arxiv.org/abs/2602.16918

作者:Shlok Mishra,Tsung-Yu Lin,Linda Wang,Hongli Xu,Yimin Liu,Michael Hsu,Chaitanya Ahuja,Hao Yuan,Jianpeng Cheng,Hong-You Chen,Haoyuan Xu,Chao Li,Abhijeet Awasthi,Jihye Moon,Don Husa,Michael Ge,Sumedha Singla,Arkabandhu Chowdhury,Phong Dingh,Satya Narayan Shukla,Yonghuan Yang,David Jacobs,Qi Guo,Jun Xiao,Xiangjun Fan,Aashu Singh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:industry-scale social media, social media data, video understanding trained, unified vision model, trained on industry-scale

备注

点击查看摘要

Abstract:We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

54. 【2602.16917】SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts

链接https://arxiv.org/abs/2602.16917

作者:Sakib Ahammed,Xia Cui,Xinqi Fan,Wenqi Lu,Moi Hoon Yap

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern vision models, models increasingly rely, include descriptive concepts, rich semantic representations, semantic

备注

点击查看摘要

Abstract:Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.

55. 【2602.16915】StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

链接https://arxiv.org/abs/2602.16915

作者:Zeyu Ren,Xiang Li,Yiran Wang,Zeyu Zhang,Hao Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe domain shifts, domain shifts caused, Stereo depth estimation, underwater robotic perception, wavelength-dependent light attenuation

备注

点击查看摘要

Abstract:Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: this https URL. Website: this https URL.

56. 【2602.16898】MALLVI: a multi agent framework for integrated generalized robotics manipulation

链接https://arxiv.org/abs/2602.16898

作者:Iman Ahmadi,Mehrshad Taji,Arad Mahdinezhad Kashani,AmirHossein Jadidi,Saina Kashani,Babak Khalaj

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:large language models, Vision Language Model, http URL, Agent Large Language, large language

备注

点击查看摘要

Abstract:Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic this http URL present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next this http URL than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full this http URL in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation this http URL available at this https URL.

57. 【2602.16872】DODO: Discrete OCR Diffusion Models

链接https://arxiv.org/abs/2602.16872

作者:Sean Man,Roy Ganz,Roi Ronen,Shahar Tsiper,Shai Mazor,Niv Nayman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical Character Recognition, Optical Character, Character Recognition, digitizing information, textual understanding

备注

点击查看摘要

Abstract:Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

58. 【2602.16856】Analytic Score Optimization for Multi Dimension Video Quality Assessment

链接https://arxiv.org/abs/2602.16856

作者:Boda Lin,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Quality Assessment, Quality, Motion Amplitude, multi-faceted evaluations, multi-dimensional VQA

备注: 18 pages

点击查看摘要

Abstract:Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.

59. 【2602.16713】hree-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins

链接https://arxiv.org/abs/2602.16713

作者:Shuo Wang,Shuo Wang,Xin Nie,Yasutaka Narazaki,Thomas Matiki,Billie F. Spencer Jr

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, Neural Radiance Field, image-based damage identifications, transcending traditional, precise three-dimensional

备注

点击查看摘要

Abstract:Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF's continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method's key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.

60. 【2602.17557】Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer's and Lewy Body Dementia Diagnosis

链接https://arxiv.org/abs/2602.17557

作者:Minheng Chen,Jing Zhang,Tong Chen,Chao Cao,Tianming Liu,Li Su,Dajiang Zhu

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:distinct diagnostic strategies, require distinct diagnostic, Lewy body dementia, Alzheimer disease, present overlapping clinical

备注

点击查看摘要

Abstract:Alzheimer's disease (AD) and Lewy body dementia (LBD) present overlapping clinical features yet require distinct diagnostic strategies. While neuroimaging-based brain network analysis is promising, atlas-based representations may obscure individualized anatomy. Gyral folding-based networks using three-hinge gyri provide a biologically grounded alternative, but inter-individual variability in cortical folding results in inconsistent landmark correspondence and highly irregular network sizes, violating the fixed-topology and node-alignment assumptions of most existing graph learning methods, particularly in clinical datasets where pathological changes further amplify anatomical heterogeneity. We therefore propose a probability-invariant random-walk-based framework that classifies individualized gyral folding networks without explicit node alignment. Cortical similarity networks are built from local morphometric features and represented by distributions of anonymized random walks, with an anatomy-aware encoding that preserves permutation invariance. Experiments on a large clinical cohort of AD and LBD subjects show consistent improvements over existing gyral folding and atlas-based models, demonstrating robustness and potential for dementia diagnosis.

61. 【2602.17556】Neural Implicit Representations for 3D Synthetic Aperture Radar Imaging

链接https://arxiv.org/abs/2602.17556

作者:Nithin Sugavanam,Emre Ertin

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic aperture radar, spatial Fourier transform, Synthetic aperture, spatial Fourier, Fourier transform

备注

点击查看摘要

Abstract:Synthetic aperture radar (SAR) is a tomographic sensor that measures 2D slices of the 3D spatial Fourier transform of the scene. In many operational scenarios, the measured set of 2D slices does not fill the 3D space in the Fourier domain, resulting in significant artifacts in the reconstructed imagery. Traditionally, simple priors, such as sparsity in the image domain, are used to regularize the inverse problem. In this paper, we review our recent work that achieves state-of-the-art results in 3D SAR imaging employing neural structures to model the surface scattering that dominates SAR returns. These neural structures encode the surface of the objects in the form of a signed distance function learned from the sparse scattering data. Since estimating a smooth surface from a sparse and noisy point cloud is an ill-posed problem, we regularize the surface estimation by sampling points from the implicit surface representation during the training step. We demonstrate the model's ability to represent target scattering using measured and simulated data from single vehicles and a larger scene with a large number of vehicles. We conclude with future research directions calling for methods to learn complex-valued neural representations to enable synthesizing new collections from the volumetric neural implicit representation.