本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新612篇论文,其中:
- 自然语言处理102篇
- 信息检索15篇
- 计算机视觉115篇
自然语言处理
1. 【2602.04884】Reinforced Attention Learning
链接:https://arxiv.org/abs/2602.04884
作者:Bangzheng Li,Jianmo Ni,Chen Qu,Ian Miao,Liu Yang,Xingyu Fu,Muhao Chen,Derek Zhiyuan Cheng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, reasoning in Large, Reinforcement Learning, substantially improved reasoning, Language Models
备注:
点击查看摘要
Abstract:Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
Subjects:
Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2602.04884 [cs.CL]
(or
arXiv:2602.04884v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.04884
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2602.04879】Rethinking the Trust Region in LLM Reinforcement Learning
链接:https://arxiv.org/abs/2602.04879
作者:Penghui Qi,Xiangxin Zhou,Zichen Liu,Tianyu Pang,Chao Du,Min Lin,Wee Sun Lee
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, facto standard algorithm, Proximal Policy Optimization, fine-tuning Large Language
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
3. 【2602.04863】Subliminal Effects in Your Data: A General Mechanism via Log-Linearity
链接:https://arxiv.org/abs/2602.04863
作者:Ishaq Aden-Ali,Noah Golowich,Allen Liu,Abhishek Shetty,Ankur Moitra,Nika Haghtalab
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:Training modern large, modern large language, making it critical, Training modern, modern large
备注: Code available at [this https URL](https://github.com/ishaqadenali/logit-linear-selection)
点击查看摘要
Abstract:Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.
Comments:
Code available at this https URL
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:
arXiv:2602.04863 [cs.LG]
(or
arXiv:2602.04863v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.04863
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2602.04856】CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
链接:https://arxiv.org/abs/2602.04856
作者:Zhao Tong,Chunlin Gong,Yiping Zhang,Qiang Liu,Xingcheng Xu,Shu Wu,Haichao Shi,Xiao-Yu Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, response signifies safe, signifies safe reasoning, refusal response signifies
备注: 28 pages, 35 figures
点击查看摘要
Abstract:From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
5. 【2602.04853】Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"
链接:https://arxiv.org/abs/2602.04853
作者:Dhruv Madhwal,Lyuxin David Zhang,Dan Roth,Tomer Wolfson,Vivek Gupta
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, closed-book question answering, question answering, leading to confident
备注:
点击查看摘要
Abstract:Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.
6. 【2602.04816】Horizon-LM: A RAM-Centric Architecture for LLM Training
链接:https://arxiv.org/abs/2602.04816
作者:Zhengqing Yuan,Lichao Sun,Yanfang(Fanny)Ye
类目:Operating Systems (cs.OS); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:scale increasingly constrained, single-GPU hardware, large language models, outpaced the evolution, evolution of single-GPU
备注:
点击查看摘要
Abstract:The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.
7. 【2602.04811】SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
链接:https://arxiv.org/abs/2602.04811
作者:Jiarui Yuan,Tailin Jin,Weize Chen,Zeyuan Liu,Zhiyuan Liu,Maosong Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:True self-evolution requires, solve future problems, True self-evolution, act as lifelong, lifelong learners
备注: Under review
点击查看摘要
Abstract:True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at this https URL.
8. 【2602.04804】OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
链接:https://arxiv.org/abs/2602.04804
作者:Yue Ding,Yiyan Ji,Jungang Li,Xuyang Liu,Xinlong Chen,Junfei Wu,Bozhou Li,Bohan Zeng,Yang Shi,Yushuo Guan,Yuanxing Zhang,Jiaheng Liu,Qiang Liu,Pengfei Wan,Liang Wang
类目:Computation and Language (cs.CL)
关键词:Omni-modal Large Language, Large Language Models, Large Language, demonstrated strong capabilities, Omni-modal Large
备注: Code will be released soon
点击查看摘要
Abstract:Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
9. 【2602.04776】Speaker-Aware Simulation Improves Conversational Speech Recognition
链接:https://arxiv.org/abs/2602.04776
作者:Máté Gedeon,Péter Mihajlik
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Automatic speech recognition, speech remains challenging, remains challenging due, Automatic speech, conversational speech remains
备注:
点击查看摘要
Abstract:Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.
10. 【2602.04764】Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation
链接:https://arxiv.org/abs/2602.04764
作者:Luis Frentzen Salim,Esteban Carlin,Alexandre Morinvil,Xi Ai,Lun-Wei Ku
类目:Computation and Language (cs.CL)
关键词:Building machine translation, Building machine, Large Language Models, notably difficult due, difficult due
备注: 8 pages, 18 figures, EACL 2026 Conference - LoResMT workshop
点击查看摘要
Abstract:Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.
11. 【2602.04755】When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
链接:https://arxiv.org/abs/2602.04755
作者:Xinyu Zhou,Chang Jin,Carsten Eickhoff,Zhijiang Guo,Seyed Ali Bahrainian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, rarely admit uncertainty, misleading answers, rarely admit
备注: Accepted to ICLR2026
点击查看摘要
Abstract:Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
12. 【2602.04750】Exploiting contextual information to improve stance detection in informal political discourse with LLMs
链接:https://arxiv.org/abs/2602.04750
作者:Arman Engin Sucu,Yixiang Zhou,Mario A. Nascimento,Tony Mullen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models, informal online discourse, Large Language, online discourse
备注: 14 pages, 7 figures
点击查看摘要
Abstract:This study investigates the use of Large Language Models (LLMs) for political stance detection in informal online discourse, where language is often sarcastic, ambiguous, and context-dependent. We explore whether providing contextual information, specifically user profile summaries derived from historical posts, can improve classification accuracy. Using a real-world political forum dataset, we generate structured profiles that summarize users' ideological leaning, recurring topics, and linguistic patterns. We evaluate seven state-of-the-art LLMs across baseline and context-enriched setups through a comprehensive cross-model evaluation. Our findings show that contextual prompts significantly boost accuracy, with improvements ranging from +17.5\% to +38.5\%, achieving up to 74\% accuracy that surpasses previous approaches. We also analyze how profile size and post selection strategies affect performance, showing that strategically chosen political content yields better results than larger, randomly selected contexts. These findings underscore the value of incorporating user-level context to enhance LLM performance in nuanced political classification tasks.
13. 【2602.04742】Inference-Time Reasoning Selectively Reduces Implicit Social Bias in Large Language Models
链接:https://arxiv.org/abs/2602.04742
作者:Molly Apsel,Michael N. Jones
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:Drawing on constructs, large language models, Implicit Association Test, identified a distinction, large language
备注:
点击查看摘要
Abstract:Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.
14. 【2602.04739】Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases
链接:https://arxiv.org/abs/2602.04739
作者:Casey Ford,Madison Van Doren,Emily Dix
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:prompting remains underexplored, adversarial prompting remains, Claude Sonnet, real-world systems, remains underexplored
备注: under peer-review
点击查看摘要
Abstract:Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
15. 【2602.04735】From Data to Behavior: Predicting Unintended Model Behaviors Before Training
链接:https://arxiv.org/abs/2602.04735
作者:Mengru Wang,Zhenqian Xu,Junfeng Fang,Yunzhi Yao,Shumin Deng,Huajun Chen,Ningyu Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, seemingly benign training, malicious content, Language Models
备注: Work in progress
点击查看摘要
Abstract:Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
16. 【2602.04731】Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
链接:https://arxiv.org/abs/2602.04731
作者:Sameh Khattab,Jean-Philippe Corbeil,Osman Alperen Koraş,Amin Dada,Julian Friedrich,François Beaulieu,Paul Vozila,Jens Kleesiek
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:grounding Large Language, Large Language Models, Large Language, improving knowledge updates, Retrieval-augmented generation
备注: Preprint
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
17. 【2602.04729】"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
链接:https://arxiv.org/abs/2602.04729
作者:Madison Van Doren,Casey Ford,Jennifer Barajas,Cory Holland
类目:Computation and Language (cs.CL)
关键词:large language models, machine translation produced, large-scale human evaluation, multilingual large language, assessing cultural localisation
备注: under peer-review
点击查看摘要
Abstract:We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but of ten overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.
Comments:
under peer-review
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.04729 [cs.CL]
(or
arXiv:2602.04729v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.04729
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2602.04718】Identifying Intervenable and Interpretable Features via Orthogonality Regularization
链接:https://arxiv.org/abs/2602.04718
作者:Moritz Miller,Florent Draye,Bernhard Schölkopf
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:fixed sparse autoencoder, fine-tuning language models, sparse autoencoder, recent progress, progress on fine-tuning
备注:
点击查看摘要
Abstract:With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{this https URL}$.
19. 【2602.04716】Linguistically Informed Evaluation of Multilingual ASR for African Languages
链接:https://arxiv.org/abs/2602.04716
作者:Fei-Yueh Chen,Lateef Adeleke,C.M. Downey
类目:Computation and Language (cs.CL)
关键词:Word Error Rate, mischaracterizes ASR models', ASR models' performance, mischaracterizes ASR, Error Rate
备注: To appear at AfricaNLP 2026
点击查看摘要
Abstract:Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
20. 【2602.04706】LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
链接:https://arxiv.org/abs/2602.04706
作者:Yike Sun,Haotong Yang,Zhouchen Lin,Muhan Zhang
类目:Computation and Language (cs.CL)
关键词:Tokenization is fundamental, language models represent, process text, architectures and training, represent and process
备注:
点击查看摘要
Abstract:Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
21. 【2602.04705】ERNIE 5.0 Technical Report
链接:https://arxiv.org/abs/2602.04705
作者:Haifeng Wang,Hua Wu,Tian Wu,Yu Sun,Jing Liu,Dianhai Yu,Yanjun Ma,Jingzhou He,Zhongjun He,Dou Hong,Qiwen Liu,Shuohuan Wang,Junyuan Shang,Zhenyu Zhang,Yuchen Ding,Jinle Zeng,Jiabin Yang,Liang Shen,Ruibiao Chen,Weichong Yin,Siyu Ding,Dai Dai,Shikun Feng,Siqi Bao,Bolei He,Yan Chen,Zhenyu Jiao,Ruiqing Zhang,Zeyu Chen,Qingqing Dang,Kaipeng Deng,Jiajun Jiang,Enlei Gong,Guoxia Wang,Yanlin Sha,Yi Liu,Yehan Zheng,Weijian Xu,Jiaxiang Liu,Zengfeng Zeng,Yingqi Qu,Zhongli Li,Zhengkun Zhang,Xiyang Wang,Zixiang Xu,Xinchao Xu,Zhengjie Huang,Dong Wang,Bingjin Chen,Yue Chang,Xing Yuan,Shiwei Huang,Qiao Zhao,Xinzhe Ding,Shuangshuang Qiao,Baoshan Yang,Bihong Tang,Bin Li,Bingquan Wang,Binhan Tang,Binxiong Zheng,Bo Cui,Bo Ke,Bo Zhang,Bowen Zhang,Boyan Zhang,Boyang Liu,Caiji Zhang,Can Li,Chang Xu,Chao Pang,Chao Zhang,Chaoyi Yuan,Chen Chen,Cheng Cui,Chenlin Yin,Chun Gan,Chunguang Chai,Chuyu Fang,Cuiyun Han,Dan Zhang,Danlei Feng,Danxiang Zhu,Dong Sun,Dongbo Li,Dongdong Li,Dongdong Liu,Dongxue Liu,Fan Ding,Fan Hu,Fan Li,Fan Mo,Feisheng Wu,Fengwei Liu,Gangqiang Hu,Gaofeng Lu,Gaopeng Yong,Gexiao Tian,Guan Wang,Guangchen Ni
类目:Computation and Language (cs.CL)
关键词:introduce ERNIE, ERNIE, natively autoregressive foundation, foundation model desinged, modality-agnostic expert routing
备注:
点击查看摘要
Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
22. 【2602.04693】LinGO: A Linguistic Graph Optimization Framework with LLMs for Interpreting Intents of Online Uncivil Discourse
链接:https://arxiv.org/abs/2602.04693
作者:Yuan Zhang,Thales Bertaglia
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Detecting uncivil language, democratic online spaces, Detecting uncivil, maintaining safe, crucial for maintaining
备注:
点击查看摘要
Abstract:Detecting uncivil language is crucial for maintaining safe, inclusive, and democratic online spaces. Yet existing classifiers often misinterpret posts containing uncivil cues but expressing civil intents, leading to inflated estimates of harmful incivility online. We introduce LinGO, a linguistic graph optimization framework for large language models (LLMs) that leverages linguistic structures and optimization techniques to classify multi-class intents of incivility that use various direct and indirect expressions. LinGO decomposes language into multi-step linguistic components, identifies targeted steps that cause the most errors, and iteratively optimizes prompt and/or example components for targeted steps. We evaluate it using a dataset collected during the 2022 Brazilian presidential election, encompassing four forms of political incivility: Impoliteness (IMP), Hate Speech and Stereotyping (HSST), Physical Harm and Violent Political Rhetoric (PHAVPR), and Threats to Democratic Institutions and Values (THREAT). Each instance is annotated with six types of civil/uncivil intent. We benchmark LinGO using three cost-efficient LLMs: GPT-5-mini, Gemini 2.5 Flash-Lite, and Claude 3 Haiku, and four optimization techniques: TextGrad, AdalFlow, DSPy, and Retrieval-Augmented Generation (RAG). The results show that, across all models, LinGO consistently improves accuracy and weighted F1 compared with zero-shot, chain-of-thought, direct optimization, and fine-tuning baselines. RAG is the strongest optimization technique and, when paired with Gemini model, achieves the best overall performance. These findings demonstrate that incorporating multi-step linguistic components into LLM instructions and optimize targeted components can help the models explain complex semantic meanings, which can be extended to other complex semantic explanation tasks in the future.
23. 【2602.04687】Investigating Disability Representations in Text-to-Image Models
链接:https://arxiv.org/abs/2602.04687
作者:Yang Yian,Yu Fan,Liudmila Zavolokina,Sarah Ebling
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:represent social groups, made remarkable progress, producing high-quality visual, high-quality visual content, textual descriptions
备注: 21 pages, 9 figures. References included
点击查看摘要
Abstract:Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
24. 【2602.04680】Audio ControlNet for Fine-Grained Audio Generation and Editing
链接:https://arxiv.org/abs/2602.04680
作者:Haina Zhu,Yao Xiao,Xiquan Li,Ziyang Ma,Jianwei Yu,Bowen Zhang,Mingqi Yang,Xie Chen
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:study the fine-grained, generation task, models, pitch, Abstract
备注:
点击查看摘要
Abstract:We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
25. 【2602.04674】Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility
链接:https://arxiv.org/abs/2602.04674
作者:Eun Cheol Choi,Lindsay E. Young,Emilio Ferrara
类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, computational social science, misinformation remains unclear, Large language, remains unclear
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as proxies for human judgment in computational social science, yet their ability to reproduce patterns of susceptibility to misinformation remains unclear. We test whether LLM-simulated survey respondents, prompted with participant profiles drawn from social survey data measuring network, demographic, attitudinal and behavioral features, can reproduce human patterns of misinformation belief and sharing. Using three online surveys as baselines, we evaluate whether LLM outputs match observed response distributions and recover feature-outcome associations present in the original survey data. LLM-generated responses capture broad distributional tendencies and show modest correlation with human responses, but consistently overstate the association between belief and sharing. Linear models fit to simulated responses exhibit substantially higher explained variance and place disproportionate weight on attitudinal and behavioral features, while largely ignoring personal network characteristics, relative to models fit to human responses. Analyses of model-generated reasoning and LLM training data suggest that these distortions reflect systematic biases in how misinformation-related concepts are represented. Our findings suggest that LLM-based survey simulations are better suited for diagnosing systematic divergences from human judgment than for substituting it.
26. 【2602.04669】Delving into Muon and Beyond: Deep Analysis and Extensions
链接:https://arxiv.org/abs/2602.04669
作者:Xianbiao Qi,Marco Chen,Jiaquan Ye,Yelin He,Rong Xiao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:recently attracted considerable, attracted considerable attention, Adam remain insufficiently, strong empirical performance, remain insufficiently understood
备注: This paper studies matrix-based optimizers (e.g., Muon) from a spectral perspective and unifies a range of methods under a common spectral framework
点击查看摘要
Abstract:The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbol{\Sigma}^{p} V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at this https URL.
27. 【2602.04659】Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers
链接:https://arxiv.org/abs/2602.04659
作者:Lukas Radosky,Miroslav Blstak,Matej Krajcovic,Ivan Polasek
类目:Computation and Language (cs.CL)
关键词:Semantic textual similarity, language processing tasks, Semantic textual, natural language processing, textual similarity
备注: This is a preprint of a paper that was presented at the IEEE 24th World Symposium on Applied Machine Intelligence and Informatics (SAMI 2026)
点击查看摘要
Abstract:Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI's embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.
28. 【2602.04649】Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
链接:https://arxiv.org/abs/2602.04649
作者:Binghai Wang,Yantao Liu,Yuxuan Liu,Tianyi Tang,Shenzhi Wang,Chang Gao,Chujie Zheng,Yichang Zhang,Le Yu,Shixuan Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Bowen Yu,Fei Huang,Junyang Lin
类目:Computation and Language (cs.CL)
关键词:Generative Reward Models, Generative Reward, producing correct judgments, Reward Models, prioritize Outcome Accuracy
备注:
点击查看摘要
Abstract:Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
29. 【2602.04630】Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
链接:https://arxiv.org/abs/2602.04630
作者:Tim Kunt,Annika Buchholz,Imene Khebouri,Thorsten Koch,Ida Litzel,Thi Huong Vu
类目:Computation and Language (cs.CL)
关键词:Large text data, text data sets, data sets, text-based media, inherit two distinct
备注:
点击查看摘要
Abstract:Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
30. 【2602.04617】LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
链接:https://arxiv.org/abs/2602.04617
作者:Ruixiao Yang,Yuanhe Tian,Xu Yang,Huiqi Li,Yan Song
类目:Computation and Language (cs.CL)
关键词:Radiology Report Generation, Radiology Report, aims to produce, medical images, produce accurate
备注:
点击查看摘要
Abstract:Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.
31. 【2602.04613】Disentangling meaning from language in LLM-based machine translation
链接:https://arxiv.org/abs/2602.04613
作者:Théo Lasnier,Armel Zebaze,Djamé Seddah,Rachel Bawden,Benoît Sagot
类目:Computation and Language (cs.CL)
关键词:neural networks implement, Large Language Models, Mechanistic Interpretability, scale of Large, work in Machine
备注: 61 pages, 70 figures
点击查看摘要
Abstract:Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
32. 【2602.04607】Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection
链接:https://arxiv.org/abs/2602.04607
作者:Junhao Liu,Haonan Yu,Zhenyu Yan,Xin Zhang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, massive context windows, achieving surgical feature-level, handle massive context
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) scale to handle massive context windows, achieving surgical feature-level interpretation is essential for high-stakes tasks like legal auditing and code debugging. However, existing local model-agnostic explanation methods face a critical dilemma in these scenarios: feature-based methods suffer from attribution dilution due to high feature dimensionality, thus failing to provide faithful explanations. In this paper, we propose Focus-LIME, a coarse-to-fine framework designed to restore the tractability of surgical interpretation. Focus-LIME utilizes a proxy model to curate the perturbation neighborhood, allowing the target model to perform fine-grained attribution exclusively within the optimized context. Empirical evaluations on long-context benchmarks demonstrate that our method makes surgical explanations practicable and provides faithful explanations to users.
33. 【2602.04605】RexBERT: Context Specialized Bidirectional Encoders for E-commerce
链接:https://arxiv.org/abs/2602.04605
作者:Rahul Bajaj,Anuj Garg
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Encoder-only transformers remain, transformers remain indispensable, Encoder-only transformers, indispensable in retrieval, systems where latency
备注: Blog: [this https URL](https://huggingface.co/blog/thebajajra/rexbert-encoders) Models: [this https URL](https://huggingface.co/collections/thebajajra/rexbert) Ecom-niverse Dataset: [this https URL](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
点击查看摘要
Abstract:Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT's architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.
34. 【2602.04604】Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays
链接:https://arxiv.org/abs/2602.04604
作者:Lucile Favero,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver
类目:Computation and Language (cs.CL)
关键词:Automated Essay Scoring, complex essay genres, Automated Essay, Argumentative Essay Scoring, Essay Scoring
备注:
点击查看摘要
Abstract:Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.
35. 【2602.04587】VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
链接:https://arxiv.org/abs/2602.04587
作者:Jaeyoon Jung,Yejun Yoon,Seunghyun Yoon,Kunwoo Park
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:paper describes VILLAIN, prompt-based multi-agent collaboration, describes VILLAIN, multimodal fact-checking system, multi-agent collaboration
备注: A system description paper for the AVerImaTeC shared task at the Ninth FEVER Workshop (co-located with EACL 2026)
点击查看摘要
Abstract:This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at this https URL.
36. 【2602.04581】rust The Typical
链接:https://arxiv.org/abs/2602.04581
作者:Debargha Ganguly,Sreehari Sankar,Biyao Zhang,Vikash Singh,Kanan Gupta,Harshini Kavuru,Alan Luo,Weicong Chen,Warren Morningstar,Raghu Machiraju,Vipin Chaudhary
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
关键词:LLM safety fundamentally, Current approaches, safety fundamentally rely, approaches to LLM, game of identifying
备注:
点击查看摘要
Abstract:Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
37. 【2602.04579】AIANO: Enhancing Information Retrieval with AI-Augmented Annotation
链接:https://arxiv.org/abs/2602.04579
作者:Sameh Khattab,Marie Bauer,Lukas Heine,Till Rostalski,Jens Kleesiek,Julian Friedrich
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, rise of Large
备注:
点击查看摘要
Abstract:The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf annotation tools that make the annotation process complex and inefficient. To streamline this process, we developed a specialized annotation tool - AIANO. By adopting an AI-augmented annotation workflow that tightly integrates human expertise with LLM assistance, AIANO enables annotators to leverage AI suggestions while retaining full control over annotation decisions. In a within-subject user study ($n = 15$), participants created question-answering datasets using both a baseline tool and AIANO. AIANO nearly doubled annotation speed compared to the baseline while being easier to use and improving retrieval accuracy. These results demonstrate that AIANO's AI-augmented approach accelerates and enhances dataset creation for information retrieval tasks, advancing annotation capabilities in retrieval-intensive domains.
38. 【2602.04577】Semantic Self-Distillation for Language Model Uncertainty
链接:https://arxiv.org/abs/2602.04577
作者:Edward Phillips,Sean Wu,Boyan Gao,David A. Clifton
类目:Computation and Language (cs.CL)
关键词:Large language models, principled uncertainty quantification, models present challenges, Large language, present challenges
备注:
点击查看摘要
Abstract:Large language models present challenges for principled uncertainty quantification, in part due to their complexity and the diversity of their outputs. Semantic dispersion, or the variance in the meaning of sampled answers, has been proposed as a useful proxy for model uncertainty, but the associated computational cost prohibits its use in latency-critical applications. We show that sampled semantic distributions can be distilled into lightweight student models which estimate a prompt-conditioned uncertainty before the language model generates an answer token. The student model predicts a semantic distribution over possible answers; the entropy of this distribution provides an effective uncertainty signal for hallucination prediction, and the probability density allows candidate answers to be evaluated for reliability. On TriviaQA, our student models match or outperform finite-sample semantic dispersion for hallucination prediction and provide a strong signal for out-of-domain answer detection. We term this technique Semantic Self-Distillation (SSD), which we suggest provides a general framework for distilling predictive uncertainty in complex output spaces beyond language.
39. 【2602.04570】Can LLMs capture stable human-generated sentence entropy measures?
链接:https://arxiv.org/abs/2602.04570
作者:Estrella Pivel-Villanueva,Elisabeth Frederike Sterner,Franziska Knolle
类目:Computation and Language (cs.CL)
关键词:Predicting upcoming words, Predicting upcoming, quantified using Shannon, Shannon entropy, entropy
备注:
点击查看摘要
Abstract:Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (1) required as few as 20 responses and high-entropy sentences (2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.
40. 【2602.04557】xtual Planning with Explicit Latent Transitions
链接:https://arxiv.org/abs/2602.04557
作者:Eliezer Shlomi,Ido Levy,Eilam Shapira,Michael Katz,Guy Uziel,Segev Shlomov,Nir Mashkif,Roi Reichart,Sarah Keren
类目:Computation and Language (cs.CL)
关键词:full forward passes, making multi-step lookahead, repeated full forward, rollout-based search expensive, forward passes
备注:
点击查看摘要
Abstract:Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.
41. 【2602.04556】Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates
链接:https://arxiv.org/abs/2602.04556
作者:Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:compact language models, compact language, Weight tying, weight sharing, hidden states
备注: an early-stage version
点击查看摘要
Abstract:Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, weight sharing does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and making post-training interventions such as editing, patching, and lightweight adaptation less predictable. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by thin polar decomposition for teacher initialization or random orthonormal initialization from scratch, and introduces a fully learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and any vocabulary-sized auxiliary parameters. We evaluate PIT on on-device models spanning 256M-1.3B parameters across pretraining and adaptation, and consistently observe improved training stability, stronger layerwise semantic consistency, and substantially reduced side effects.
42. 【2602.04546】Unmasking Superspreaders: Data-Driven Approaches for Identifying and Comparing Key Influencers of Conspiracy Theories on X.com
链接:https://arxiv.org/abs/2602.04546
作者:Florian Kramer,Henrich R. Greve,Moritz von Zahn,Hayagreeva Rao
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Human Superspreaders, deepening polarization, spreading misinformation, democratic institutions, Superspreaders
备注:
点击查看摘要
Abstract:Conspiracy theories can threaten society by spreading misinformation, deepening polarization, and eroding trust in democratic institutions. Social media often fuels the spread of conspiracies, primarily driven by two key actors: Superspreaders -- influential individuals disseminating conspiracy content at disproportionately high rates, and Bots -- automated accounts designed to amplify conspiracies strategically. To counter the spread of conspiracy theories, it is critical to both identify these actors and to better understand their behavior. However, a systematic analysis of these actors as well as real-world-applicable identification methods are still lacking. In this study, we leverage over seven million tweets from the COVID-19 pandemic to analyze key differences between Human Superspreaders and Bots across dimensions such as linguistic complexity, toxicity, and hashtag usage. Our analysis reveals distinct communication strategies: Superspreaders tend to use more complex language and substantive content while relying less on structural elements like hashtags and emojis, likely to enhance credibility and authority. By contrast, Bots favor simpler language and strategic cross-usage of hashtags, likely to increase accessibility, facilitate infiltration into trending discussions, and amplify reach. To counter both Human Superspreaders and Bots, we propose and evaluate 27 novel metrics for quantifying the severity of conspiracy theory spread. Our findings highlight the effectiveness of an adapted H-Index for computationally feasible identification of Human Superspreaders. By identifying behavioral patterns unique to Human Superspreaders and Bots as well as providing suitable identification methods, this study provides a foundation for mitigation strategies, including platform moderation policies, temporary and permanent account suspensions, and public awareness campaigns.
43. 【2602.04541】LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
链接:https://arxiv.org/abs/2602.04541
作者:Gang Lin,Dongfang Li,Zhuoen Chen,Yukun Shi,Xuhui Chen,Baotian Hu,Min Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:rapidly expanding key-value, expanding key-value cache, imposes heavy memory, large language models, long-context large language
备注: ICLR 2026
点击查看摘要
Abstract:The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
44. 【2602.04540】PersoPilot: An Adaptive AI-Copilot for Transparent Contextualized Persona Classification and Personalized Response Generation
链接:https://arxiv.org/abs/2602.04540
作者:Saleh Afzoon,Amin Beheshti,Usman Naseem
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:delivering effective personalization, critical for delivering, delivering effective, classifying user personas, classifying user
备注: Accepted for the Demo Track at the IEEE International Conference on Data Mining (ICDM) 2025
点击查看摘要
Abstract:Understanding and classifying user personas is critical for delivering effective personalization. While persona information offers valuable insights, its full potential is realized only when contextualized, linking user characteristics with situational context to enable more precise and meaningful service provision. Existing systems often treat persona and context as separate inputs, limiting their ability to generate nuanced, adaptive interactions. To address this gap, we present PersoPilot, an agentic AI-Copilot that integrates persona understanding with contextual analysis to support both end users and analysts. End users interact through a transparent, explainable chat interface, where they can express preferences in natural language, request recommendations, and receive information tailored to their immediate task. On the analyst side, PersoPilot delivers a transparent, reasoning-powered labeling assistant, integrated with an active learning-driven classification process that adapts over time with new labeled data. This feedback loop enables targeted service recommendations and adaptive personalization, bridging the gap between raw persona data and actionable, context-aware insights. As an adaptable framework, PersoPilot is applicable to a broad range of service personalization scenarios.
45. 【2602.04521】$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
链接:https://arxiv.org/abs/2602.04521
作者:Aditya Kasliwal,Pratinav Seth,Vinay Kumar Sankarapu
类目:Computation and Language (cs.CL); Emerging Technologies (cs.ET)
关键词:Modern deployments require, enforce safety policies, deployments require LLMs, Modern deployments, add recurring compute
备注:
点击查看摘要
Abstract:Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically 5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
46. 【2602.04514】ReFRAME or Remain: Unsupervised Lexical Semantic Change Detection with Frame Semantics
链接:https://arxiv.org/abs/2602.04514
作者:Bach Phan-Tat,Kris Heylen,Dirk Geeraerts,Stefano De Pascale,Dirk Speelman
类目:Computation and Language (cs.CL)
关键词:embedding distributional representations, contemporary computational methods, neural embedding distributional, detection are based, majority of contemporary
备注:
点击查看摘要
Abstract:The majority of contemporary computational methods for lexical semantic change (LSC) detection are based on neural embedding distributional representations. Although these models perform well on LSC benchmarks, their results are often difficult to interpret. We explore an alternative approach that relies solely on frame semantics. We show that this method is effective for detecting semantic change and can even outperform many distributional semantic models. Finally, we present a detailed quantitative and qualitative analysis of its predictions, demonstrating that they are both plausible and highly interpretable
47. 【2602.04509】Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
链接:https://arxiv.org/abs/2602.04509
作者:Hyeontaek Hwang,Nguyen Dinh Son,Daeyoung Kim
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Fine-tuning Multimodal Large, Multimodal Large, Large Language Models, Large Language
备注:
点击查看摘要
Abstract:Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
48. 【2602.04493】PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation
链接:https://arxiv.org/abs/2602.04493
作者:Saleh Afzoon,MohammadHossein Ahmadi,Usman Naseem,Amin Beheshti
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:building effective persona-grounded, persona-grounded dialogue systems, effective persona-grounded dialogue, essential components, components in building
备注: Accepted at WISE 2025 Conference
点击查看摘要
Abstract:Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
49. 【2602.04489】Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough
链接:https://arxiv.org/abs/2602.04489
作者:Dario Paape,Tal Linzen,Shravan Vasishth
类目:Computation and Language (cs.CL)
关键词:ambiguous garden-path sentences, temporarily ambiguous garden-path, bidirectional self-paced reading, eye tracking, striker wondered
备注:
点击查看摘要
Abstract:Using temporarily ambiguous garden-path sentences ("While the team trained the striker wondered ...") as a test case, we present a latent-process mixture model of human reading behavior across four different reading paradigms (eye tracking, uni- and bidirectional self-paced reading, Maze). The model distinguishes between garden-path probability, garden-path cost, and reanalysis cost, and yields more realistic processing cost estimates by taking into account trials with inattentive reading. We show that the model is able to reproduce empirical patterns with regard to rereading behavior, comprehension question responses, and grammaticality judgments. Cross-validation reveals that the mixture model also has better predictive fit to human reading patterns and end-of-trial task data than a mixture-free model based on GPT-2-derived surprisal values. We discuss implications for future work.
50. 【2602.04486】Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition
链接:https://arxiv.org/abs/2602.04486
作者:Jinlong Ma,Yu Zhang,Xuefeng Bai,Kehai Chen,Yuwei Wang,Zeming Liu,Jun Yu,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Named Entity Recognition, Grounded Multimodal Named, Multimodal Named Entity, Entity Recognition, Named Entity
备注: GMNER
点击查看摘要
Abstract:Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
51. 【2602.04466】Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks
链接:https://arxiv.org/abs/2602.04466
作者:Masaya Tsunokake,Yuta Koreeda,Terufumi Morishita,Koichi Nagatsuka,Hikaru Tomonari,Yasuhiro Sogawa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:textbf, real-world enterprise operations, small domains, applying LLMs, mDAPT
备注: 13 pages, 9 figures, Accepted by EACL2026 Industry Track
点击查看摘要
Abstract:When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ($\textbf{micro domains}$). A previous study shows micro domain-adaptive pre-training ($\textbf{mDAPT}$) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) $\textbf{eliciting}$ facts relevant to questions from an LLM's own knowledge, (2) $\textbf{reasoning}$ over the facts to obtain conclusions, and (3) $\textbf{composing}$ long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT's effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.
52. 【2602.04456】Growth First, Care Second? Tracing the Landscape of LLM Value Preferences in Everyday Dilemmas
链接:https://arxiv.org/abs/2602.04456
作者:Zhiyi Chen,Eun Cheol Choi,Yingjia Luo,Xinyi Wang,Yulei Xiao,Aizi Yang,Luca Luceri
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:People increasingly seek, People increasingly, increasingly seek advice, seek advice online, large language model
备注: dataset available at [this https URL](https://github.com/Renesmeeczy/Value-Trade-off-in-Reddit-Dilemmas)
点击查看摘要
Abstract:People increasingly seek advice online from both human peers and large language model (LLM)-based chatbots. Such advice rarely involves identifying a single correct answer; instead, it typically requires navigating trade-offs among competing values. We aim to characterize how LLMs navigate value trade-offs across different advice-seeking contexts. First, we examine the value trade-off structure underlying advice seeking using a curated dataset from four advice-oriented subreddits. Using a bottom-up approach, we inductively construct a hierarchical value framework by aggregating fine-grained values extracted from individual advice options into higher-level value categories. We construct value co-occurrence networks to characterize how values co-occur within dilemmas and find substantial heterogeneity in value trade-off structures across advice-seeking contexts: a women-focused subreddit exhibits the highest network density, indicating more complex value conflicts; women's, men's, and friendship-related subreddits exhibit highly correlated value-conflict patterns centered on security-related tensions (security vs. respect/connection/commitment); by contrast, career advice forms a distinct structure where security frequently clashes with self-actualization and growth. We then evaluate LLM value preferences against these dilemmas and find that, across models and contexts, LLMs consistently prioritize values related to Exploration Growth over Benevolence Connection. This systemically skewed value orientation highlights a potential risk of value homogenization in AI-mediated advice, raising concerns about how such systems may shape decision-making and normative outcomes at scale.
53. 【2602.04442】No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
链接:https://arxiv.org/abs/2602.04442
作者:Dmitry Karpov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Turkic language pairs, explore machine translation, Turkic language, language pairs, achieved chrF
备注: Accepted to EACL 2026 (LoResMT workshop)
点击查看摘要
Abstract:We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
54. 【2602.04428】Fine-Grained Activation Steering: Steering Less, Achieving More
链接:https://arxiv.org/abs/2602.04428
作者:Zijian Feng,Tianjiao Li,Zixiao Zhu,Hanzhang Zhou,Junlang Qian,Li Zhang,Jia Jim Deryl Chua,Lee Onn Mak,Gee Wah Ng,Kezhi Mao
类目:Computation and Language (cs.CL)
关键词:large language model, modifying large language, language model, cost-effective paradigm, paradigm for modifying
备注: ICLR 2026
点击查看摘要
Abstract:Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
55. 【2602.04413】History-Guided Iterative Visual Reasoning with Self-Correction
链接:https://arxiv.org/abs/2602.04413
作者:Xinglong Yang,Zhilin Peng,Zhanzhan Liu,Haochen Shi,Sheng-Jun Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:multimodal large language, large language models, Self-consistency methods, core technique, reliability of multimodal
备注:
点击查看摘要
Abstract:Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.
56. 【2602.04399】Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models
链接:https://arxiv.org/abs/2602.04399
作者:Yu Zhang,Xinchen Li,Jialei Zhou,Hongnan Ma,Zhongwei Wan,Yiwei Shi,Duoqian Miao,Qi Zhang,Longbing Cao
类目:Computation and Language (cs.CL)
关键词:diffusion language models, combining inter-block sequential, inter-block sequential denoising, decoding effectively improves, Block-wise decoding effectively
备注:
点击查看摘要
Abstract:Block-wise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations.
57. 【2602.04398】Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts
链接:https://arxiv.org/abs/2602.04398
作者:Yujie Lin,Kunquan Li,Yixuan Liao,Xiaoxin Chen,Jinsong Su
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:language processing tasks, Large language models, natural language processing, demonstrated impressive capabilities, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: this https URL.
58. 【2602.04392】Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models
链接:https://arxiv.org/abs/2602.04392
作者:Isabel Tsintsiper,Sheng Wong,Beth Albert,Shaun P Brennecke,Gabriel Davis Jones
类目:Computation and Language (cs.CL)
关键词:clinical decision support, Large language models, workflows for documentation, decision support, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.
59. 【2602.04391】Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning
链接:https://arxiv.org/abs/2602.04391
作者:Jie Deng,Hanshuang Tong,Jun Li,Shining Liang,Ning Wu,Hongzhi Li,Yutao Xie
类目:Computation and Language (cs.CL)
关键词:Large language models, made impressive strides, Large language, made impressive, impressive strides
备注:
点击查看摘要
Abstract:Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
60. 【2602.04355】Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models
链接:https://arxiv.org/abs/2602.04355
作者:Sichu Liang,Hongyu Zhu,Wenwen Wang,Deyu Zhou
类目:Computation and Language (cs.CL)
关键词:updating task-relevant information, providing a dynamic, central component, component of intelligent, dynamic workspace
备注:
点击查看摘要
Abstract:Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
61. 【2602.04326】From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents
链接:https://arxiv.org/abs/2602.04326
作者:SeungWon Seo,SooBin Lim,SeongRae Noh,Haneul Kim,HyeongYeop Kang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:Embodied agents operating, Embodied agents, Large Language Models, applying Large Language, partially observable
备注: 31 pages, 10 figures, Accepted ICLR 2026
点击查看摘要
Abstract:Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators' intentions. Recent advances in applying Large Language Models (LLMs) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent communication. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in LLM reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy communication. Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse LLM backbones, PCE consistently outperforms communication-centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces communication patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent LLM assumptions into reliable strategies for uncertainty-aware planning.
62. 【2602.04320】A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction
链接:https://arxiv.org/abs/2602.04320
作者:Marco Martinelli,Stefano Marchesin,Vanessa Bonato,Giorgio Maria Di Nunzio,Nicola Ferro,Ornella Irrera,Laura Menotti,Federica Vezzani,Gianmaria Silvello
类目:Computation and Language (cs.CL)
关键词:Named Entity Recognition, Named Entity Linking, encompassing Named Entity, Named Entity, Information Extraction
备注: Accepted to EACL 2026
点击查看摘要
Abstract:Information Extraction (IE), encompassing Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GutBrainIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark's rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.
63. 【2602.04306】DeFrame: Debiasing Large Language Models Against Framing Effects
链接:https://arxiv.org/abs/2602.04306
作者:Kahee Lim,Soyeon Kim,Steven Euijong Whang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, language models, real-world applications, large language, increasingly deployed
备注: 40 pages, 12 figures
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.
64. 【2602.04304】Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
链接:https://arxiv.org/abs/2602.04304
作者:Zipeng Zhu,Zhanghao Hu,Qinglin Zhu,Yuxi Hong,Yijun Liu,Jingyong Su,Yulan He,Lin Gui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, text embedding space, uniform pretraining resolution, fixed visual-token budget, visual-token budget forces
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
65. 【2602.04297】Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification
链接:https://arxiv.org/abs/2602.04297
作者:Branislav Pecher,Michal Spiegel,Robert Belanec,Jan Cegin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, few-shot classifiers, controlled through prompting, Large language, zero-shot and few-shot
备注:
点击查看摘要
Abstract:Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
66. 【2602.04294】How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks
链接:https://arxiv.org/abs/2602.04294
作者:Yanshu Wang,Shuaishuai Yang,Jingjing He,Tong Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Large Language Models, Large Language, Language Models, face increasing threats, bypass safety alignment
备注: 13 pages, 4 figures, 6 tables
点击查看摘要
Abstract:Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.
67. 【2602.04290】Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision
链接:https://arxiv.org/abs/2602.04290
作者:Lingzhuang Sun,Ruitong Liu,Yuxia Zhu,Xiaohan Xu,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Reinforcement Learning, Large Language, complex reasoning capabilities
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
68. 【2602.04289】Proxy Compression for Language Modeling
链接:https://arxiv.org/abs/2602.04289
作者:Lin Zheng,Xinyu Li,Qian Liu,Xiachong Feng,Lingpeng Kong
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:token sequences produced, Modern language models, Modern language, exclusively on token, external lossless compressor
备注:
点击查看摘要
Abstract:Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.
69. 【2602.04288】Contextual Drag: How Errors in the Context Affect LLM Reasoning
链接:https://arxiv.org/abs/2602.04288
作者:Yun Cheng,Xingyu Zhu,Haoyu Zhao,Sanjeev Arora
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, past mistakes, self-improvement pipelines, pipelines for large, large language
备注:
点击查看摘要
Abstract:Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.
70. 【2602.04279】ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
链接:https://arxiv.org/abs/2602.04279
作者:Jiarui Jin,Haoyu Wang,Xingliang Wu,Xiaocheng Fang,Xiang Lan,Zihan Wang,Deyun Zhang,Bo Liu,Yingying Zhang,Xian Wu,Hongyan Li,Shenda Hong
类目:Computation and Language (cs.CL)
关键词:large language models, clinically incorrect analyses, existing multimodal large, multimodal large language, indispensable diagnostic tool
备注:
点击查看摘要
Abstract:Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{this https URL}{here}, and an online platform can be accessed at \href{this http URL}{here}.
71. 【2602.04254】Scaling Agentic Verifier for Competitive Coding
链接:https://arxiv.org/abs/2602.04254
作者:Zeyao Ma,Jing Zhang,Xiaokang Zhang,Jiaxi Yang,Zongmeng Zhang,Jiajun Zhang,Yuheng Jing,Lei Zhang,Hao Zheng,Wenting Zhao,Junyang Lin,Binyuan Hui
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, programming problems correctly, demonstrated strong coding, strong coding capabilities
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.
72. 【2602.04248】Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search
链接:https://arxiv.org/abs/2602.04248
作者:Hao Lu,Haoyuan Huang,Yulin Zhou,Chen Li,Ningxin Zhu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Monte Carlo Tree, Carlo Tree Search, Language Models, Monte Carlo
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.
73. 【2602.04247】DementiaBank-Emotion: A Multi-Rater Emotion Annotation Corpus for Alzheimer's Disease Speech (Version 1.0)
链接:https://arxiv.org/abs/2602.04247
作者:Cheonkam Jeong,Jessica Liao,Audrey Lu,Yutong Song,Christopher Rashidian,Donna Krogh,Erik Krogh,Mahkameh Rasouli,Jung-Ah Lee,Nikil Dutt,Lisa M Gibbs,David Sultzer,Julie Rousseau,Jocelyn Ludlow,Margaret Galvez,Alexander Nuth,Chet Khay,Sabine Brunswicker,Adeline Nyamathi
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:Alzheimer disease, present DementiaBank-Emotion, Alzheimer, Delta, corpus for Alzheimer
备注: Accepted at HeaLING Workshop @ EACL 2026. 9 pages, 3 figures, 8 tables
点击查看摘要
Abstract:We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer's disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman's six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.
74. 【2602.04246】CoLT: Reasoning with Chain of Latent Tool Calls
链接:https://arxiv.org/abs/2602.04246
作者:Fangwei Zhu,Zhifang Sui
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, token-level reasoning chain, inefficient token-level reasoning, Language Models
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
75. 【2602.04241】okenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation
链接:https://arxiv.org/abs/2602.04241
作者:Nuo Xu,Ahrii Kim
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, critically affects Natural, affects Natural Language, families remains under-explored, Byte Pair Encoding
备注:
点击查看摘要
Abstract:Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.
76. 【2602.04224】RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
链接:https://arxiv.org/abs/2602.04224
作者:Zeming Wei,Qiaosheng Zhang,Xia Hu,Xingcheng Xu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
关键词:Large Reasoning Models, basic language models, achieved tremendous success, language models, Large Reasoning
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at this https URL.
77. 【2602.04217】Frontend Token Enhancement for Token-Based Speech Recognition
链接:https://arxiv.org/abs/2602.04217
作者:Takanori Ashihara,Shota Horiguchi,Kohei Matsuura,Tsubasa Ochiai,Marc Delcroix
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Discretized representations, including automatic speech, automatic speech recognition, including automatic, signals are efficient
备注: Accepted at ICASSP 2026
点击查看摘要
Abstract:Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
78. 【2602.04212】Language Models Struggle to Use Representations Learned In-Context
链接:https://arxiv.org/abs/2602.04212
作者:Michael A. Lepori,Tal Linzen,Ann Yuan,Katja Filippova
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enabled great success, artificial intelligence research, enabled great, great success, wide variety
备注:
点击查看摘要
Abstract:Though large language models (LLMs) have enabled great success across a wide variety of tasks, they still appear to fall short of one of the loftier goals of artificial intelligence research: creating an artificial system that can adapt its behavior to radically new contexts upon deployment. One important step towards this goal is to create systems that can induce rich representations of data that are seen in-context, and then flexibly deploy these representations to accomplish goals. Recently, Park et al. (2024) demonstrated that current LLMs are indeed capable of inducing such representation from context (i.e., in-context representation learning). The present study investigates whether LLMs can use these representations to complete simple downstream tasks. We first assess whether open-weights LLMs can use in-context representations for next-token prediction, and then probe models using a novel task, adaptive world modeling. In both tasks, we find evidence that open-weights LLMs struggle to deploy representations of novel semantics that are defined in-context, even if they encode these semantics in their latent representations. Furthermore, we assess closed-source, state-of-the-art reasoning models on the adaptive world modeling task, demonstrating that even the most performant LLMs cannot reliably leverage novel patterns presented in-context. Overall, this work seeks to inspire novel methods for encouraging models to not only encode information presented in-context, but to do so in a manner that supports flexible deployment of this information.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.04212 [cs.CL]
(or
arXiv:2602.04212v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.04212
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2602.04206】Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM-Based Inquiry
链接:https://arxiv.org/abs/2602.04206
作者:Hsien-Jyh Liao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, exhibit impressive linguistic, impressive linguistic fluency, Large language, language models
备注: Submitted to ICAIL 2026. Under review
点击查看摘要
Abstract:Large language models (LLMs) exhibit impressive linguistic fluency but struggle to reliably complete long-horizon tasks under explicit procedural constraints. In legal cross-examination, purely proba-bilistic generation often maintains behavioral coherence while failing to ensure procedural advancement. We characterize this failure as procedural stagnation and propose Soft-FSM, a neuro-symbolic architecture that enforces monotonic progress over accumulated Key Information Units (KIUs) via an external deterministic state controller. Experiments on three real-world Taiwanese criminal homicide cases show that baseline methods collapse below 40% completeness, while Soft-FSM consistently achieves over 97% with near-zero redundancy. These results suggest that, in such domains, reliable task completion cannot be guaranteed by emergent LLM behavior alone, and can be reliably enforced through explicit and verifiable external state control.
80. 【2602.04197】From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
链接:https://arxiv.org/abs/2602.04197
作者:Xinyue Wang,Yuanhe Zhang,Zhengshuo Gong,Haoran Gao,Fanyu Meng,Zhenhong Zhou,Li Sun,Yang Liu,Sen Su
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Toxic Proactivity, tool-use abilities, enhanced capabilities, capabilities of LLM-based, planning and tool-use
备注: 9 pages (excluding appendices), 6 figures. Code is available at [this https URL](https://github.com/wxyoio-0715/Toxic-Proactivity)
点击查看摘要
Abstract:The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
81. 【2602.04196】he Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment
链接:https://arxiv.org/abs/2602.04196
作者:Zhexin Zhang,Yida Lu,Junfeng Fang,Junxiao Yang,Shiyao Cui,Hao Zhou,Fandong Meng,Jie Zhou,Hongning Wang,Minlie Huang,Tat-Seng Chua
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:elicit harmful outputs, deployment time, widely studied, studied at deployment, jailbreak attacks
备注:
点击查看摘要
Abstract:Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
82. 【2602.04145】raining Data Efficiency in Multimodal Process Reward Models
链接:https://arxiv.org/abs/2602.04145
作者:Jinyuan Li,Chengsong Huang,Langlin Huang,Shaoyang Xu,Haolin Liu,Wenxuan Zhang,Jiaxin Huang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:Multimodal Process Reward, Process Reward Models, Multimodal Process, Reward Models, Process Reward
备注:
点击查看摘要
Abstract:Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM this http URL preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated this http URL explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
83. 【2602.04127】From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?
链接:https://arxiv.org/abs/2602.04127
作者:Sercan Karakaş,Yusuf Şimşek
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Light verb constructions, verbal multiword expressions, productive complex predicates, complex predicates create, predicates create minimal
备注: EACL SIGTURK
点击查看摘要
Abstract:Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish, where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicate meanings and literal verb--argument uses. This paper asks what signals drive LVC classification by systematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines (lemma TF--IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regression over UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlled diagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wise performance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficient for robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but is sensitive to calibration and normalization choices. Overall, Our findings motivate targeted evaluation of Turkish MWEs and show that ``lemma-only'' is not a single, well-defined representation, but one that depends critically on how normalization is operationalized.
84. 【2602.04112】DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling
链接:https://arxiv.org/abs/2602.04112
作者:Jiangnan Yang,Junjie Chen,Fei Wang,Yiqi Nie,Yuxin Liu,Zhangling Duan,Jie Chen
类目:Computation and Language (cs.CL)
关键词:clinicians integrate verbal, integrate verbal content, infer clients' mental, fundamentally multimodal cognitive, clients' mental states
备注:
点击查看摘要
Abstract:Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients' mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
85. 【2602.04105】Expert Selections In MoE Models Reveal (Almost) As Much As Text
链接:https://arxiv.org/abs/2602.04105
作者:Amir Nuriyev,Gabriel Kulp
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:present a text-reconstruction, text-reconstruction attack, language models, expert selections, Abstract
备注:
点击查看摘要
Abstract:We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
86. 【2602.04099】Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs
链接:https://arxiv.org/abs/2602.04099
作者:Letian Cheng,Junyan Wang,Yan Gao,Elliott Wen,Ting Dang,Hong Jia
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:widely adopted metric, large language models, widely adopted, quality of large, large language
备注:
点击查看摘要
Abstract:Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.
87. 【2602.04089】Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
链接:https://arxiv.org/abs/2602.04089
作者:Xiaofeng Lin,Sirou Zhu,Yilei Chen,Mingyu Chen,Hejian Sang,Ioannis Paschalidis,Zhipeng Wang,Aldo Pacchiano,Xuezhou Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:achieve strong performance, achieve strong, instruction-following problems, static prediction, prediction and instruction-following
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at this https URL.
88. 【2602.04085】BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning
链接:https://arxiv.org/abs/2602.04085
作者:Min Jang,Orevaoghene Ahia,Nazif Tamer,Sachin Kumar,Yulia Tsvetkov,Noah A. Smith
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:evaluate music understanding, Music understanding, semantic elements, requires reasoning, structural segmentation
备注:
点击查看摘要
Abstract:Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.
89. 【2602.04081】Abstraction Induces the Brain Alignment of Language and Speech Models
链接:https://arxiv.org/abs/2602.04081
作者:Emily Cheng,Aditya R. Vaidya,Richard Antonello
类目:Computation and Language (cs.CL)
关键词:hidden states extracted, Research has repeatedly, intermediate hidden states, natural language stimuli, measured brain response
备注: under review
点击查看摘要
Abstract:Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their next-word prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension, a measure of feature complexity. We show that a layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity is rich meaning abstraction of the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.
90. 【2602.04074】Stroke Lesions as a Rosetta Stone for Language Model Interpretability
链接:https://arxiv.org/abs/2602.04074
作者:Julius Fridriksson(1,2),Roger D. Newman-Norlund(1,2),Saeed Ahmadi(1),Regan Willis(3),Nadra Salman(4),Kalil Warren(4),Xiang Guan(3),Yong Yang(3),Srihari Nelakuditi(3),Rutvik Desai(5),Leonardo Bonilha(6),Jeff Charney(2,7),Chris Rorden(5) ((1) University of South Carolina, (2) a href="http://ALLT.AI" rel="external noopener nofollow" class="link-external link-http"this http URL/a, LLC, (3) University of South Carolina, Department of Computer Science and Engineering, (4) University of South Carolina, Linguistics Program, (5) Department of Psychology, University of South Carolina, (6) Department of Neurology, USC School of Medicine, (7) MKHSTRY, LLC)
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:achieved remarkable capabilities, function remain limited, Large language models, language function remain, Large language
备注: 45 pages, 17 figures
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable capabilities, yet methods to verify which model components are truly necessary for language function remain limited. Current interpretability approaches rely on internal metrics and lack external validation. Here we present the Brain-LLM Unified Model (BLUM), a framework that leverages lesion-symptom mapping, the gold standard for establishing causal brain-behavior relationships for over a century, as an external reference structure for evaluating LLM perturbation effects. Using data from individuals with chronic post-stroke aphasia (N = 410), we trained symptom-to-lesion models that predict brain damage location from behavioral error profiles, applied systematic perturbations to transformer layers, administered identical clinical assessments to perturbed LLMs and human patients, and projected LLM error profiles into human lesion space. LLM error profiles were sufficiently similar to human error profiles that predicted lesions corresponded to actual lesions in error-matched humans above chance in 67% of picture naming conditions (p 10^{-23}) and 68.3% of sentence completion conditions (p 10^{-61}), with semantic-dominant errors mapping onto ventral-stream lesion patterns and phonemic-dominant errors onto dorsal-stream patterns. These findings open a new methodological avenue for LLM interpretability in which clinical neuroscience provides external validation, establishing human lesion-symptom mapping as a reference framework for evaluating artificial language systems and motivating direct investigation of whether behavioral alignment reflects shared computational principles.
91. 【2602.04033】On the Credibility of Evaluating LLMs using Survey Questions
链接:https://arxiv.org/abs/2602.04033
作者:Jindřich Libovický
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Recent studies evaluate, Recent studies, large language models, adapted social surveys, studies evaluate
备注: Accepted to the Workshop on Multilingual and Multicultural Evaluation at EACL 2026, 12 pages, 2 figures
点击查看摘要
Abstract:Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
92. 【2602.04017】Chaplains' Reflections on the Design and Usage of AI for Conversational Care
链接:https://arxiv.org/abs/2602.04017
作者:Joel Wester,Samuel Rhys Cox,Henning Pohl,Niels van Berkel
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:requires domain knowledge, domain knowledge, current work, diagnosis and intervention, growing recognition
备注: To appear at ACM CHI 2026. 15 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Despite growing recognition that responsible AI requires domain knowledge, current work on conversational AI primarily draws on clinical expertise that prioritises diagnosis and intervention. However, much of everyday emotional support needs occur in non-clinical contexts, and therefore requires different conversational approaches. We examine how chaplains, who guide individuals through personal crises, grief, and reflection, perceive and engage with conversational AI. We recruited eighteen chaplains to build AI chatbots. While some chaplains viewed chatbots with cautious optimism, the majority expressed limitations of chatbots' ability to support everyday well-being. Our analysis reveals how chaplains perceive their pastoral care duties and areas where AI chatbots fall short, along the themes of Listening, Connecting, Carrying, and Wanting. These themes resonate with the idea of attunement, recently highlighted as a relational lens for understanding the delicate experiences care technologies provide. This perspective informs chatbot design aimed at supporting well-being in non-clinical contexts.
93. 【2602.03980】ransformers perform adaptive partial pooling
链接:https://arxiv.org/abs/2602.03980
作者:Vsevolod Kapatsinski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reasonable language model, reasonable language, context, language model, language is creative
备注: 6 pages, submitted to the annual meeting of the Cognitive Science Society
点击查看摘要
Abstract:Because language is creative, any reasonable language model must generalize, deciding what to say in novel contexts by using information from similar contexts. But what about contexts that are not novel but merely infrequent? In hierarchical regression, the model's predictions for behavior in a context are affected by observations from other similar contexts to the extent that 1) the current context is infrequent and 2) different contexts behave similarly. This is called adaptive partial pooling of evidence. This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context across epochs of training (the amount of pooling reduces with training), and that the extent of pooling is affected by context frequency, context number (type frequency) and context variability in a similar way to hierarchical regression. These characteristics of learning in transformers are argued to be realistic on both rational and empirical grounds.
94. 【2602.03979】Likelihood-Based Reward Designs for General LLM Reasoning
链接:https://arxiv.org/abs/2602.03979
作者:Ariel Kwiatkowski,Natasha Butt,Ismail Labiad,Julia Kempe,Yann Ollivier
类目:Computation and Language (cs.CL)
关键词:large language models, Fine-tuning large language, reinforcement learning requires, specific reward function, language models
备注:
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
95. 【2602.03962】Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines
链接:https://arxiv.org/abs/2602.03962
作者:Erik Saule,Kalpathi Subramanian,Razvan Bunescu
类目:Computation and Language (cs.CL)
关键词:Computer Science program, Professional societies, publish curriculum guidelines, Computer Science, societies often publish
备注:
点击查看摘要
Abstract:Professional societies often publish curriculum guidelines to help programs align their content to international standards. In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program. While very helpful, it remains difficult for program administrators to assess how much of the guidelines is being covered by a CS program. This is in particular due to the extensiveness of the guidelines, containing thousands of individual items. As such, it is time consuming and cognitively demanding to audit every course to confidently mark everything that is actually being covered. Our preliminary work indicated that it takes about a day of work per course. In this work, we propose using Natural Language Processing techniques to accelerate the process. We explore two kinds of techniques, the first relying on traditional tools for parsing, tagging, and embeddings, while the second leverages the power of Large Language Models. We evaluate the application of these techniques to classify a corpus of pedagogical materials and show that we can meaningfully classify documents automatically.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.03962 [cs.CL]
(or
arXiv:2602.03962v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.03962
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
96. 【2602.03942】Linguistic Blind Spots in Clinical Decision Extraction
链接:https://arxiv.org/abs/2602.03942
作者:Mohamed Elgaar,Hadi Amiri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Extracting medical decisions, Extracting medical, patient-facing care summaries, clinical decision support, key step
备注: EACL HeaLing Workshop 2026
点击查看摘要
Abstract:Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans--common in advice and precaution decisions--are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
97. 【2602.03916】SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
链接:https://arxiv.org/abs/2602.03916
作者:Azmine Toushik Wasi,Wahid Faisal,Abdur Rahman,Mahfuz Ahmed Anik,Munem Shahriar,Mohsin Mahmud Topu,Sadia Tasnim Meem,Rahatun Nesa Priti,Sabrina Afroz Mitu,Md. Iqramul Hoque,Shahriyar Zaman Ridoy,Mohammed Eunus Ali,Majd Hawasly,Mohammad Raza,Md Rizwan Parvez
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Spatial reasoning, contemporary vision-language models, Spatial, fundamental aspect, contemporary vision-language
备注: Accepted to ICLR 2026. 92 Pages. 42 Figures and 29 Tables
点击查看摘要
Abstract:Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth Occlusion, Orientation, Size Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: this https URL.
98. 【2602.03849】HybridQuestion: Human-AI Collaboration for Identifying High-Impact Research Questions
链接:https://arxiv.org/abs/2602.03849
作者:Keyu Zhao,Fengli Xu,Yong Li,Tie-Yan Liu
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:automating key stages, transforming scientific research, paradigm is transforming, scholarly writing, idea generation
备注: 16 pages, 6 figures, 4 tables
点击查看摘要
Abstract:The "AI Scientist" paradigm is transforming scientific research by automating key stages of the research process, from idea generation to scholarly writing. This shift is expected to accelerate discovery and expand the scope of scientific inquiry. However, a key question remains unclear: can AI scientists identify meaningful research questions? While Large Language Models (LLMs) have been applied successfully to task-specific ideation, their potential to conduct strategic, long-term assessments of past breakthroughs and future questions remains largely unexplored. To address this gap, we explore a human-AI hybrid solution that integrates the scalable data processing capabilities of AI with the value judgment of human experts. Our methodology is structured in three phases. The first phase, AI-Accelerated Information Gathering, leverages AI's advantage in processing vast amounts of literature to generate a hybrid information base. The second phase, Candidate Question Proposing, utilizes this synthesized data to prompt an ensemble of six diverse LLMs to propose an initial candidate pool, filtered via a cross-model voting mechanism. The third phase, Hybrid Question Selection, refines this pool through a multi-stage filtering process that progressively increases human oversight. To validate this system, we conducted an experiment aiming to identify the Top 10 Scientific Breakthroughs of 2025 and the Top 10 Scientific Questions for 2026 across five major disciplines. Our analysis reveals that while AI agents demonstrate high alignment with human experts in recognizing established breakthroughs, they exhibit greater divergence in forecasting prospective questions, suggesting that human judgment remains crucial for evaluating subjective, forward-looking challenges.
99. 【2601.19410】Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?
链接:https://arxiv.org/abs/2601.19410
作者:Ahrii Kim,Seong-heum Kim
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:refine machine translations, correcting residual errors, aims to refine, APE, refine machine
备注:
点击查看摘要
Abstract:Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE--especially under document-level context--remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement.
Subjects:
Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2601.19410 [cs.CL]
(or
arXiv:2601.19410v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.19410
Focus to learn more
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.36227/techrxiv.176107895.57699371/v1
Focus to learn more
DOI(s) linking to related resources</p>
100. 【2405.18605】Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature
链接:https://arxiv.org/abs/2405.18605
作者:Mai H. Nguyen,Shibani Likhite,Jiawei Tang,Darshini Mahendran,Bridget T. McInnes
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Molecular Networks (q-bio.MN)
关键词:chemical-gene relations plays, Graph Convolutional Networks, Bidirectional Encoder Representations, compounds and genes, drug discovery
备注:
点击查看摘要
Abstract:The extraction of chemical-gene relations plays a pivotal role in understanding the intricate interactions between chemical compounds and genes, with significant implications for drug discovery, disease understanding, and biomedical research. This paper presents a data set created by merging the ChemProt and DrugProt datasets to augment sample counts and improve model accuracy. We evaluate the merged dataset using two state of the art relationship extraction algorithms: Bidirectional Encoder Representations from Transformers (BERT) specifically BioBERT, and Graph Convolutional Networks (GCNs) combined with BioBERT. While BioBERT excels at capturing local contexts, it may benefit from incorporating global information essential for understanding chemical-gene interactions. This can be achieved by integrating GCNs with BioBERT to harness both global and local context. Our results show that by integrating the ChemProt and DrugProt datasets, we demonstrated significant improvements in model performance, particularly in CPR groups shared between the datasets. Incorporating the global context using GCN can help increase the overall precision and recall in some of the CPR groups over using just BioBERT.
101. 【2602.04307】Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement
链接:https://arxiv.org/abs/2602.04307
作者:Chien-Chun Wang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
关键词:exhibited remarkable capabilities, automatic speech recognition, exhibited remarkable, remarkable capabilities, capabilities under matched
备注: Accepted to IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP)
点击查看摘要
Abstract:Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
102. 【2602.03868】Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts
链接:https://arxiv.org/abs/2602.03868
作者:Chandrashekar M S,Vineet Singh,Lakshmi Pedapudi
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Automatic Speech Recognition, robust Automatic Speech, India requires robust, requires robust Automatic, Speech Recognition
备注: 9 pages, 6 figures
点击查看摘要
Abstract:The digitization of agricultural advisory services in India requires robust Automatic Speech Recognition (ASR) systems capable of accurately transcribing domain-specific terminology in multiple Indian languages. This paper presents a benchmarking framework for evaluating ASR performance in agricultural contexts across Hindi, Telugu, and Odia languages. We introduce evaluation metrics including Agriculture Weighted Word Error Rate (AWWER) and domain-specific utility scoring to complement traditional metrics. Our evaluation of 10,934 audio recordings, each transcribed by up to 10 ASR models, reveals performance variations across languages and models, with Hindi achieving the best overall performance (WER: 16.2%) while Odia presents the greatest challenges (best WER: 35.1%, achieved only with speaker diarization). We characterize audio quality challenges inherent to real-world agricultural field recordings and demonstrate that speaker diarization with best-speaker selection can substantially reduce WER for multi-speaker recordings (upto 66% depending on the proportion of multi-speaker audio). We identify recurring error patterns in agricultural terminology and provide practical recommendations for improving ASR systems in low-resource agricultural domains. The study establishes baseline benchmarks for future agricultural ASR development.
信息检索
1. 【2602.04812】Robust Generalizable Heterogeneous Legal Link Prediction
链接:https://arxiv.org/abs/2602.04812
作者:Lorenz Wendlinger,Simon Alexander Nonn,Abdullah Al Zubaer,Michael Granitzer
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Recent work, legal citation networks, applied link prediction, large heterogeneous legal, heterogeneous legal citation
备注: 9 Pages
点击查看摘要
Abstract:Recent work has applied link prediction to large heterogeneous legal citation networks \new{with rich meta-features}. We find that this approach can be improved by including edge dropout and feature concatenation for the learning of more robust representations, which reduces error rates by up to 45%. We also propose an approach based on multilingual node features with an improved asymmetric decoder for compatibility, which allows us to generalize and extend the prediction to more, geographically and linguistically disjoint, data from New Zealand. Our adaptations also improve inductive transferability between these disjoint legal systems.
2. 【2602.04735】From Data to Behavior: Predicting Unintended Model Behaviors Before Training
链接:https://arxiv.org/abs/2602.04735
作者:Mengru Wang,Zhenqian Xu,Junfeng Fang,Yunzhi Yao,Shumin Deng,Huajun Chen,Ningyu Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, seemingly benign training, malicious content, Language Models
备注: Work in progress
点击查看摘要
Abstract:Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
3. 【2602.04711】Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention
链接:https://arxiv.org/abs/2602.04711
作者:Sagie Dekel,Moshe Tennenholtz,Oren Kurland
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Retrieval Augmented Generation, Retrieval Augmented, Augmented Generation, highly effective paradigm, keeping LLM-based responses
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) is a highly effective paradigm for keeping LLM-based responses up-to-date and reducing the likelihood of hallucinations. Yet, RAG was recently shown to be quite vulnerable to corpus knowledge poisoning: an attacker injects misleading documents to the corpus to steer an LLM's output to an undesired response. We argue that the standard causal attention mechanism in LLMs enables harmful cross-document interactions, specifically in cases of attacks. Accordingly, we introduce a novel defense approach for RAG: Sparse Document Attention RAG (SDAG). This is a block-sparse attention mechanism that disallows cross-attention between retrieved documents. SDAG requires a minimal inference-time change to the attention mask; furthermore, no fine-tuning or additional architectural changes are needed. We present an empirical evaluation of LLM-based question answering (QA) with a variety of attack strategies on RAG. We show that our SDAG method substantially outperforms the standard causal attention mechanism in terms of attack success rate. We further demonstrate the clear merits of integrating SDAG with state-of-the-art RAG defense methods. Specifically, the integration results in performance that is statistically significantly better than the state-of-the-art.
4. 【2602.04690】Multi-Source Retrieval and Reasoning for Legal Sentencing Prediction
链接:https://arxiv.org/abs/2602.04690
作者:Junjie Chen,Haitao Li,Qilei Zhang,Zhenghua Li,Ya Zhang,Quan Zhou,Cheng Luo,Yiqun Liu,Dongsheng Guo,Qingyao Ai
类目:Information Retrieval (cs.IR)
关键词:includes law article, predict judicial outcomes, typically includes law, Legal judgment prediction, legal sentencing prediction
备注:
点击查看摘要
Abstract:Legal judgment prediction (LJP) aims to predict judicial outcomes from case facts and typically includes law article, charge, and sentencing prediction. While recent methods perform well on the first two subtasks, legal sentencing prediction (LSP) remains difficult due to its need for fine-grained objective knowledge and flexible subjective reasoning. To address these limitations, we propose $MSR^2$, a framework that integrates multi-source retrieval and reasoning in LLMs with reinforcement learning. $MSR^2$ enables LLMs to perform multi-source retrieval based on reasoning needs and applies a process-level reward to guide intermediate subjective reasoning steps. Experiments on two real-world datasets show that $MSR^2$ improves both accuracy and interpretability in LSP, providing a promising step toward practical legal AI. Our code is available at this https URL.
5. 【2602.04579】AIANO: Enhancing Information Retrieval with AI-Augmented Annotation
链接:https://arxiv.org/abs/2602.04579
作者:Sameh Khattab,Marie Bauer,Lukas Heine,Till Rostalski,Jens Kleesiek,Julian Friedrich
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, rise of Large
备注:
点击查看摘要
Abstract:The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf annotation tools that make the annotation process complex and inefficient. To streamline this process, we developed a specialized annotation tool - AIANO. By adopting an AI-augmented annotation workflow that tightly integrates human expertise with LLM assistance, AIANO enables annotators to leverage AI suggestions while retaining full control over annotation decisions. In a within-subject user study ($n = 15$), participants created question-answering datasets using both a baseline tool and AIANO. AIANO nearly doubled annotation speed compared to the baseline while being easier to use and improving retrieval accuracy. These results demonstrate that AIANO's AI-augmented approach accelerates and enhances dataset creation for information retrieval tasks, advancing annotation capabilities in retrieval-intensive domains.
6. 【2602.04567】VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation
链接:https://arxiv.org/abs/2602.04567
作者:Aleksandr Poslavsky,Alexander D'yakonov,Yuriy Dorn,Andrey Zimovnov
类目:Information Retrieval (cs.IR); Computers and Society (cs.CY)
关键词:real-world platform dynamics, reflect real-world platform, presents unique challenges, modeling rapid user, rapid user interest
备注: Accepted to The ACM Web Conference 2026 (WWW '26). Preprint of conference paper. 7 pages, 2 (7) figures, 4 tables. Dataset available at: [this https URL](https://huggingface.co/datasets/deepvk/VK-LSVD)
点击查看摘要
Abstract:Short-video recommendation presents unique challenges, such as modeling rapid user interest shifts from implicit feedback, but progress is constrained by a lack of large-scale open datasets that reflect real-world platform dynamics. To bridge this gap, we introduce the VK Large Short-Video Dataset (VK-LSVD), the largest publicly available industrial dataset of its kind. VK-LSVD offers an unprecedented scale of over 40 billion interactions from 10 million users and almost 20 million videos over six months, alongside rich features including content embeddings, diverse feedback signals, and contextual metadata. Our analysis supports the dataset's quality and diversity. The dataset's immediate impact is confirmed by its central role in the live VK RecSys Challenge 2025. VK-LSVD provides a vital, open dataset to use in building realistic benchmarks to accelerate research in sequential recommendation, cold-start scenarios, and next-generation recommender systems.
7. 【2602.04546】Unmasking Superspreaders: Data-Driven Approaches for Identifying and Comparing Key Influencers of Conspiracy Theories on X.com
链接:https://arxiv.org/abs/2602.04546
作者:Florian Kramer,Henrich R. Greve,Moritz von Zahn,Hayagreeva Rao
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Human Superspreaders, deepening polarization, spreading misinformation, democratic institutions, Superspreaders
备注:
点击查看摘要
Abstract:Conspiracy theories can threaten society by spreading misinformation, deepening polarization, and eroding trust in democratic institutions. Social media often fuels the spread of conspiracies, primarily driven by two key actors: Superspreaders -- influential individuals disseminating conspiracy content at disproportionately high rates, and Bots -- automated accounts designed to amplify conspiracies strategically. To counter the spread of conspiracy theories, it is critical to both identify these actors and to better understand their behavior. However, a systematic analysis of these actors as well as real-world-applicable identification methods are still lacking. In this study, we leverage over seven million tweets from the COVID-19 pandemic to analyze key differences between Human Superspreaders and Bots across dimensions such as linguistic complexity, toxicity, and hashtag usage. Our analysis reveals distinct communication strategies: Superspreaders tend to use more complex language and substantive content while relying less on structural elements like hashtags and emojis, likely to enhance credibility and authority. By contrast, Bots favor simpler language and strategic cross-usage of hashtags, likely to increase accessibility, facilitate infiltration into trending discussions, and amplify reach. To counter both Human Superspreaders and Bots, we propose and evaluate 27 novel metrics for quantifying the severity of conspiracy theory spread. Our findings highlight the effectiveness of an adapted H-Index for computationally feasible identification of Human Superspreaders. By identifying behavioral patterns unique to Human Superspreaders and Bots as well as providing suitable identification methods, this study provides a foundation for mitigation strategies, including platform moderation policies, temporary and permanent account suspensions, and public awareness campaigns.
8. 【2602.04460】DOS: Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan
链接:https://arxiv.org/abs/2602.04460
作者:Junwei Yin,Senjie Kou,Changhao Li,Shuli Wang,Xue Wei,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang
类目:Information Retrieval (cs.IR)
关键词:generative recommendation systems, Semantic IDs serve, Semantic, key component, component in generative
备注: Accepted by WWW2026 (short paper)
点击查看摘要
Abstract:Semantic IDs serve as a key component in generative recommendation systems. They not only incorporate open-world knowledge from large language models (LLMs) but also compress the semantic space to reduce generation difficulty. However, existing methods suffer from two major limitations: (1) the lack of contextual awareness in generation tasks leads to a gap between the Semantic ID codebook space and the generation space, resulting in suboptimal recommendations; and (2) suboptimal quantization methods exacerbate semantic loss in LLMs. To address these issues, we propose Dual-Flow Orthogonal Semantic IDs (DOS) method. Specifically, DOS employs a user-item dual flow-framework that leverages collaborative signals to align the Semantic ID codebook space with the generation space. Furthermore, we introduce an orthogonal residual quantization scheme that rotates the semantic space to an appropriate orientation, thereby maximizing semantic preservation. Extensive offline experiments and online A/B testing demonstrate the effectiveness of DOS. The proposed method has been successfully deployed in Meituan's mobile application, serving hundreds of millions of users.
9. 【2602.04451】SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval
链接:https://arxiv.org/abs/2602.04451
作者:Yi Sun,Jinyu Xu,Qing Xie,Jiachen Li,Yanchun Ma,Yongjian Liu
类目:Information Retrieval (cs.IR)
关键词:Composed Image Retrieval, Multimodal Large Language, Semantic Debias Ranking, query composed, Large Language Models
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at this https URL.
10. 【2602.04278】MiniRec: Data-Efficient Reinforcement Learning for LLM-based Recommendation
链接:https://arxiv.org/abs/2602.04278
作者:Lin Wang,Yang Zhang,Jingfan Chen,Xiaoyan Zhao,Fengbin Zhu,Qing Li,Tat-Seng Chua
类目:Information Retrieval (cs.IR)
关键词:user preference modeling, improving user preference, RL-based LLM recommendation, RL-based LLM, LLM recommendation
备注:
点击查看摘要
Abstract:The integration of reinforcement learning (RL) into large language models (LLMs) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based LLM recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based LLM recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely preserving performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based LLM recommendation.
11. 【2602.04263】LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval
链接:https://arxiv.org/abs/2602.04263
作者:Joohyung Yun,Doyup Lee,Wook-Shin Han
类目:Information Retrieval (cs.IR)
关键词:retrieve query-relevant components, composed of textual, visual elements, aims to retrieve, retrieve query-relevant
备注: Project page: [this https URL](https://lilac-emnlp2025.github.io/)
点击查看摘要
Abstract:Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers - each representing coarse and fine granularity - facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on all five benchmarks, notably without additional fine-tuning. We make the artifacts publicly available at this http URL.
12. 【2602.04225】Following the TRAIL: Predicting and Explaining Tomorrow's Hits with a Fine-Tuned LLM
链接:https://arxiv.org/abs/2602.04225
作者:Yinan Zhang,Zhixi Chen,Jiazheng Jing,Zhiqi Shen
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, strong reasoning capabilities, reasoning capabilities
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have been widely applied across multiple domains for their broad knowledge and strong reasoning capabilities. However, applying them to recommendation systems is challenging since it is hard for LLMs to extract user preferences from large, sparse user-item logs, and real-time per-user ranking over the full catalog is too time-consuming to be practical. Moreover, many existing recommender systems focus solely on ranking items while overlooking explanations, which could help improve predictive accuracy and make recommendations more convincing to users. Inspired by recent works that achieve strong recommendation performance by forecasting near-term item popularity, we propose TRAIL (TRend and explAnation Integrated Learner). TRAIL is a fine-tuned LLM that jointly predicts short-term item popularity and generates faithful natural-language explanations. It employs contrastive learning with positive and negative pairs to align its scores and explanations with structured trend signals, yielding accurate and explainable popularity predictions. Extensive experiments show that TRAIL outperforms strong baselines and produces coherent, well-grounded explanations.
13. 【2602.04174】GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation
链接:https://arxiv.org/abs/2602.04174
作者:Chengzhang Wang,Chao Chen,Jun Tao,Tengfei Liu,He Bai,Song Wang,Longfei Xu,Kaikui Liu,Xiangxiang Chu
类目:Robotics (cs.RO); Graphics (cs.GR); Information Retrieval (cs.IR)
关键词:Existing industrial-scale navigation, Existing industrial-scale, industrial-scale navigation applications, navigation applications contend, massive road networks
备注:
点击查看摘要
Abstract:Existing industrial-scale navigation applications contend with massive road networks, typically employing two main categories of approaches for route planning. The first relies on precomputed road costs for optimal routing and heuristic algorithms for generating alternatives, while the second, generative methods, has recently gained significant attention. However, the former struggles with personalization and route diversity, while the latter fails to meet the efficiency requirements of large-scale real-time scenarios. To address these limitations, we propose GenMRP, a generative framework for multi-route planning. To ensure generation efficiency, GenMRP first introduces a skeleton-to-capillary approach that dynamically constructs a relevant sub-network significantly smaller than the full road network. Within this sub-network, routes are generated iteratively. The first iteration identifies the optimal route, while the subsequent ones generate alternatives that balance quality and diversity using the newly proposed correctional boosting approach. Each iteration incorporates road features, user historical sequences, and previously generated routes into a Link Cost Model to update road costs, followed by route generation using the Dijkstra algorithm. Extensive experiments show that GenMRP achieves state-of-the-art performance with high efficiency in both offline and online environments. To facilitate further research, we have publicly released the training and evaluation dataset. GenMRP has been fully deployed in a real-world navigation app, demonstrating its effectiveness and benefits.
14. 【2602.03992】Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval
链接:https://arxiv.org/abs/2602.03992
作者:Gabriel de Souza P. Moreira,Ronay Ak,Mengyao Xu,Oliver Holworthy,Benedikt Schifferer,Zhiding Yu,Yauhen Babakhin,Radek Osmulski,Jiarui Cai,Ryan Chesler,Bo Liu,Even Oldridge
类目:Information Retrieval (cs.IR)
关键词:injecting external knowledge, Retrieval-Augmented Generation, powering language models, generative applications, powering language
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2602.03992 [cs.IR]
(or
arXiv:2602.03992v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.03992
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2405.18605】Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature
链接:https://arxiv.org/abs/2405.18605
作者:Mai H. Nguyen,Shibani Likhite,Jiawei Tang,Darshini Mahendran,Bridget T. McInnes
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Molecular Networks (q-bio.MN)
关键词:chemical-gene relations plays, Graph Convolutional Networks, Bidirectional Encoder Representations, compounds and genes, drug discovery
备注:
点击查看摘要
Abstract:The extraction of chemical-gene relations plays a pivotal role in understanding the intricate interactions between chemical compounds and genes, with significant implications for drug discovery, disease understanding, and biomedical research. This paper presents a data set created by merging the ChemProt and DrugProt datasets to augment sample counts and improve model accuracy. We evaluate the merged dataset using two state of the art relationship extraction algorithms: Bidirectional Encoder Representations from Transformers (BERT) specifically BioBERT, and Graph Convolutional Networks (GCNs) combined with BioBERT. While BioBERT excels at capturing local contexts, it may benefit from incorporating global information essential for understanding chemical-gene interactions. This can be achieved by integrating GCNs with BioBERT to harness both global and local context. Our results show that by integrating the ChemProt and DrugProt datasets, we demonstrated significant improvements in model performance, particularly in CPR groups shared between the datasets. Incorporating the global context using GCN can help increase the overall precision and recall in some of the CPR groups over using just BioBERT.
计算机视觉
1. 【2602.04884】Reinforced Attention Learning
链接:https://arxiv.org/abs/2602.04884
作者:Bangzheng Li,Jianmo Ni,Chen Qu,Ian Miao,Liu Yang,Xingyu Fu,Muhao Chen,Derek Zhiyuan Cheng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, reasoning in Large, Reinforcement Learning, substantially improved reasoning, Language Models
备注:
点击查看摘要
Abstract:Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
Subjects:
Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2602.04884 [cs.CL]
(or
arXiv:2602.04884v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.04884
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2602.04877】CoWTracker: Tracking by Warping instead of Correlation
链接:https://arxiv.org/abs/2602.04877
作者:Zihang Lai,Eldar Insafutdinov,Edgar Sucar,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, robotic manipulation, Dense point, fundamental problem, problem in computer
备注: Project website: [this http URL](http://cowtracker.github.io)
点击查看摘要
Abstract:Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
3. 【2602.04876】PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation
链接:https://arxiv.org/abs/2602.04876
作者:Jiahao Zhan,Zizhang Li,Hong-Xing Yu,Jiajun Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:hybrid generative simulator, scene generation, simulator that enables, hybrid generative, generative simulator
备注: Project website: [this https URL](https://johnzhan2023.github.io/PerpetualWonder/)
点击查看摘要
Abstract:We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
4. 【2602.04873】Laminating Representation Autoencoders for Efficient Diffusion
链接:https://arxiv.org/abs/2602.04873
作者:Ramón Calvo-González,François Fleuret
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate high-quality images, directly on SSL, SSL patch features, SSL patch, Recent work
备注:
点击查看摘要
Abstract:Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.
5. 【2602.04864】When LLaVA Meets Objects: Token Composition for Vision-Language-Models
链接:https://arxiv.org/abs/2602.04864
作者:Soumya Jahagirdar,Walid Bousselham,Anna Kukleva,Hilde Kuehne
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autoregressive Vision Language, Vision Language Models, Vision Language, Current autoregressive Vision, autoregressive Vision
备注:
点击查看摘要
Abstract:Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
6. 【2602.04851】PDF-HR: Pose Distance Fields for Humanoid Robots
链接:https://arxiv.org/abs/2602.04851
作者:Yi Gu,Yukang Gao,Yangchen Zhou,Xingyu Chen,Yixiao Feng,Mingle Zhao,Yunyang Mo,Zhaorui Wang,Lixin Xu,Renjing Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:motion priors play, play a crucial, crucial role, Pose Distance Fields, motion
备注: \href{ [this https URL](https://gaoyukang33.github.io/PDF-HR/) }{Project page}
点击查看摘要
Abstract:Pose and motion priors play a crucial role in humanoid robotics. Although such priors have been widely studied in human motion recovery (HMR) domain with a range of models, their adoption for humanoid robots remains limited, largely due to the scarcity of high-quality humanoid motion data. In this work, we introduce Pose Distance Fields for Humanoid Robots (PDF-HR), a lightweight prior that represents the robot pose distribution as a continuous and differentiable manifold. Given an arbitrary pose, PDF-HR predicts its distance to a large corpus of retargeted robot poses, yielding a smooth measure of pose plausibility that is well suited for optimization and control. PDF-HR can be integrated as a reward shaping term, a regularizer, or a standalone plausibility scorer across diverse pipelines. We evaluate PDF-HR on various humanoid tasks, including single-trajectory motion tracking, general motion tracking, style-based motion mimicry, and general motion retargeting. Experiments show that this plug-and-play prior consistently and substantially strengthens strong baselines. Code and models will be released.
7. 【2602.04838】LitS: A novel Neighborhood Descriptor for Point Clouds
链接:https://arxiv.org/abs/2602.04838
作者:Jonatan B. Bastos,Francisco F. Rivera,Oscar G. Lorenzo,David L. Vilariño,José C. Cabaleiro,Alberto M. Esmorís,Tomás F. Pena
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scanning technologies, fundamental for representing, technological fields, applications that span, scientific and technological
备注:
点击查看摘要
Abstract:With the advancement of 3D scanning technologies, point clouds have become fundamental for representing 3D spatial data, with applications that span across various scientific and technological fields. Practical analysis of this data depends crucially on available neighborhood descriptors to accurately characterize the local geometries of the point cloud. This paper introduces LitS, a novel neighborhood descriptor for 2D and 3D point clouds. LitS are piecewise constant functions on the unit circle that allow points to keep track of their surroundings. Each element in LitS' domain represents a direction with respect to a local reference system. Once constructed, evaluating LitS at any given direction gives us information about the number of neighbors in a cone-like region centered around that same direction. Thus, LitS conveys a lot of information about the local neighborhood of a point, which can be leveraged to gain global structural understanding by analyzing how LitS changes between close points. In addition, LitS comes in two versions ('regular' and 'cumulative') and has two parameters, allowing them to adapt to various contexts and types of point clouds. Overall, they are a versatile neighborhood descriptor, capable of capturing the nuances of local point arrangements and resilient to common point cloud data issues such as variable density and noise.
8. 【2602.04832】It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
链接:https://arxiv.org/abs/2602.04832
作者:Hannah Pinson
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:neural networks, empirical success, understanding of neural, theoretical understanding, gradient descent
备注:
点击查看摘要
Abstract:Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
9. 【2602.04820】oward Reliable and Explainable Nail Disease Classification: Leveraging Adversarial Training and Grad-CAM Visualization
链接:https://arxiv.org/abs/2602.04820
作者:Farzia Hossain,Samanta Ghosh,Shahida Begum,B. M. Shahria Alam,Mohammad Tahmid Noor,Md Parvez Mia,Nishat Tasnim Niloy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Human nail diseases, Human nail, age groups, older individuals, gradually observed
备注: 6 pages, 12 figures. This is the author's accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore
点击查看摘要
Abstract:Human nail diseases are gradually observed over all age groups, especially among older individuals, often going ignored until they become severe. Early detection and accurate diagnosis of such conditions are important because they sometimes reveal our body's health problems. But it is challenging due to the inferred visual differences between disease types. This paper presents a machine learning-based model for automated classification of nail diseases based on a publicly available dataset, which contains 3,835 images scaling six categories. In 224x224 pixels, all images were resized to ensure consistency. To evaluate performance, four well-known CNN models-InceptionV3, DenseNet201, EfficientNetV2, and ResNet50 were trained and analyzed. Among these, InceptionV3 outperformed the others with an accuracy of 95.57%, while DenseNet201 came next with 94.79%. To make the model stronger and less likely to make mistakes on tricky or noisy images, we used adversarial training. To help understand how the model makes decisions, we used SHAP to highlight important features in the predictions. This system could be a helpful support for doctors, making nail disease diagnosis more accurate and faster.
10. 【2602.04819】XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas
链接:https://arxiv.org/abs/2602.04819
作者:Aqsa Sultana,Rayan Afsar,Ahmed Rahu,Surendra P. Singh,Brian Shula,Brandon Combs,Derrick Forchetti,Vijayan K. Asari
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Accurate risk stratification, developing colorectal cancer, routine colonoscopy screenings, Accurate risk, risk stratification
备注: 13 pages, 8 figures
点击查看摘要
Abstract:Accurate risk stratification of precancerous polyps during routine colonoscopy screenings is essential for lowering the risk of developing colorectal cancer (CRC). However, assessment of low-grade dysplasia remains limited by subjective histopathologic interpretation. Advancements in digital pathology and deep learning provide new opportunities to identify subtle and fine morphologic patterns associated with malignant progression that may be imperceptible to the human eye. In this work, we propose XtraLight-MedMamba, an ultra-lightweight state-space-based deep learning framework for classifying neoplastic tubular adenomas from whole-slide images (WSIs). The architecture is a blend of ConvNext based shallow feature extractor with parallel vision mamba to efficiently model both long- and short-range dependencies and image generalization. An integration of Spatial and Channel Attention Bridge (SCAB) module enhances multiscale feature extraction, while Fixed Non-Negative Orthogonal Classifier (FNOClassifier) enables substantial parameter reduction and improved generalization. The model was evaluated on a curated dataset acquired from patients with low-grade tubular adenomas, stratified into case and control cohorts based on subsequent CRC development. XtraLight-MedMamba achieved an accuracy of 97.18% and an F1-score of 0.9767 using approximately 32,000 parameters, outperforming transformer-based and conventional Mamba architectures with significantly higher model complexity.
11. 【2602.04814】X2HDR: HDR Image Generation in a Perceptually Uniform Space
链接:https://arxiv.org/abs/2602.04814
作者:Ronghuan Wu,Wanchao Su,Kede Ma,Jing Liao,Rafał K. Mantiuk
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:typically remain limited, large-scale HDR training, Stable Diffusion, HDR training data, Diffusion and FLUX
备注: Project page: [this https URL](https://x2hdr.github.io/) , Code: [this https URL](https://github.com/X2HDR/X2HDR)
点击查看摘要
Abstract:High-dynamic-range (HDR) formats and displays are becoming increasingly prevalent, yet state-of-the-art image generators (e.g., Stable Diffusion and FLUX) typically remain limited to low-dynamic-range (LDR) output due to the lack of large-scale HDR training data. In this work, we show that existing pretrained diffusion models can be easily adapted to HDR generation without retraining from scratch. A key challenge is that HDR images are natively represented in linear RGB, whose intensity and color statistics differ substantially from those of sRGB-encoded LDR images. This gap, however, can be effectively bridged by converting HDR inputs into perceptually uniform encodings (e.g., using PU21 or PQ). Empirically, we find that LDR-pretrained variational autoencoders (VAEs) reconstruct PU21-encoded HDR inputs with fidelity comparable to LDR data, whereas linear RGB inputs cause severe degradations. Motivated by this finding, we describe an efficient adaptation strategy that freezes the VAE and finetunes only the denoiser via low-rank adaptation in a perceptually uniform space. This results in a unified computational method that supports both text-to-HDR synthesis and single-image RAW-to-HDR reconstruction. Experiments demonstrate that our perceptually encoded adaptation consistently improves perceptual fidelity, text-image alignment, and effective dynamic range, relative to previous techniques.
12. 【2602.04802】VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
链接:https://arxiv.org/abs/2602.04802
作者:Qing'an Liu,Juntong Feng,Yuhao Wang,Xinzhe Han,Yujie Cheng,Yue Zhu,Haiwen Diao,Yunzhi Zhuge,Huchuan Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved impressive performance, existing benchmarks predominantly, benchmarks predominantly focus, achieved impressive, impressive performance
备注: 27 pages, 19 figures
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at this https URL.
13. 【2602.04789】Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention
链接:https://arxiv.org/abs/2602.04789
作者:Chengtao Lv,Yumeng Shi,Yushi Huang,Ruihao Gong,Shen Ren,Wenya Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improved visual fidelity, Advanced autoregressive, video generation models, sparse attention, fidelity and interactivity
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{this https URL}{this https URL}.
14. 【2602.04770】Generative Modeling via Drifting
链接:https://arxiv.org/abs/2602.04770
作者:Mingyang Deng,He Li,Tianhong Li,Yilun Du,Kaiming He
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative modeling, pushforward distribution matches, formulated as learning, learning a mapping, matches the data
备注: Project page: [this https URL](https://lambertae.github.io/projects/drifting/)
点击查看摘要
Abstract:Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.
15. 【2602.04749】Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation
链接:https://arxiv.org/abs/2602.04749
作者:Buddhi Wijenayake,Nichula Wasalathilake,Roshan Godaliyadda,Vijitha Herath,Parakrama Ekanayake,Vishal M. Patel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long-tailed pixel imbalance, typically exhibits severe, exhibits severe long-tailed, severe long-tailed pixel, high-resolution remote-sensing imagery
备注:
点击查看摘要
Abstract:Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label--image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \href{this https URL}{Github}
16. 【2602.04722】How to rewrite the stars: Mapping your orchard over time through constellations of fruits
链接:https://arxiv.org/abs/2602.04722
作者:Gonçalo P. Matos,Carlos Santiago,João P. Costeira,Ricardo L. Saldanha,Ernesto M. Morgado
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:predict fruit setting, manually measure fruit, measure fruit sizes, early stages, caliper or dendrometers
备注: submitted to IEEE International Conference on Robotics Automation
点击查看摘要
Abstract:Following crop growth through the vegetative cycle allows farmers to predict fruit setting and yield in early stages, but it is a laborious and non-scalable task if performed by a human who has to manually measure fruit sizes with a caliper or dendrometers. In recent years, computer vision has been used to automate several tasks in precision agriculture, such as detecting and counting fruits, and estimating their size. However, the fundamental problem of matching the exact same fruits from one video, collected on a given date, to the fruits visible in another video, collected on a later date, which is needed to track fruits' growth through time, remains to be solved. Few attempts were made, but they either assume that the camera always starts from the same known position and that there are sufficiently distinct features to match, or they used other sources of data like GPS. Here we propose a new paradigm to tackle this problem, based on constellations of 3D centroids, and introduce a descriptor for very sparse 3D point clouds that can be used to match fruits across videos. Matching constellations instead of individual fruits is key to deal with non-rigidity, occlusions and challenging imagery with few distinct visual features to track. The results show that the proposed method can be successfully used to match fruits across videos and through time, and also to build an orchard map and later use it to locate the camera pose in 6DoF, thus providing a method for autonomous navigation of robots in the orchard and for selective fruit picking, for example.
17. 【2602.04713】Adaptive Prompt Elicitation for Text-to-Image Generation
链接:https://arxiv.org/abs/2602.04713
作者:Xinyi Wen,Lena Hegemann,Xiaofu Jin,Shuai Ma,Antti Oulasvirta
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:provide ambiguous inputs, Adaptive Prompt Elicitation, provide ambiguous, ambiguous inputs, inputs and struggle
备注: ACM International Conference on Intelligent User Interfaces (IUI) 2026, March 23-26, Paphos, Cyprus
点击查看摘要
Abstract:Aligning text-to-image generation with user intent remains challenging, for users who provide ambiguous inputs and struggle with model idiosyncrasies. We propose Adaptive Prompt Elicitation (APE), a technique that adaptively asks visual queries to help users refine prompts without extensive writing. Our technical contribution is a formulation of interactive intent inference under an information-theoretic framework. APE represents latent intent as interpretable feature requirements using language model priors, adaptively generates visual queries, and compiles elicited requirements into effective prompts. Evaluation on IDEA-Bench and DesignBench shows that APE achieves stronger alignment with improved efficiency. A user study with challenging user-defined tasks demonstrates 19.8% higher alignment without workload overhead. Our work contributes a principled approach to prompting that, for general users, offers an effective and efficient complement to the prevailing prompt-based interaction paradigm with text-to-image models.
18. 【2602.04712】SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation
链接:https://arxiv.org/abs/2602.04712
作者:David F. Ramirez,Tim Overman,Kristen Jaskie,Joe Marvin,Andreas Spanias
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:synthetic aperture radar, assisted AI agent, aperture radar, SAR Retrieval-Augmented Generation, present a visual-context
备注: Submitted to 2026 IEEE Radar Conference
点击查看摘要
Abstract:We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
19. 【2602.04699】Annotation Free Spacecraft Detection and Segmentation using Vision Language Models
链接:https://arxiv.org/abs/2602.04699
作者:Samet Hicsonmez,Jose Sosa,Dan Pineau,Inder Pal Singh,Arunkumar Rathinam,Abd El Rahman Shabayek,Djamila Aouada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Vision Language, zero-shot visual recognition, demonstrated remarkable performance, open-world zero-shot visual
备注: ICRA 2026
点击查看摘要
Abstract:Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher-student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at this https URL.
20. 【2602.04692】DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking
链接:https://arxiv.org/abs/2602.04692
作者:Sijia Chen,Lijuan Ma,Yanqiu Yu,En Yu,Liman Liu,Wenbing Tao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Referring Multi-Object Tracking, RGBD Referring Multi-Object, track specific targets, specific targets based, Referring Multi-Object
备注:
点击查看摘要
Abstract:Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.
21. 【2602.04687】Investigating Disability Representations in Text-to-Image Models
链接:https://arxiv.org/abs/2602.04687
作者:Yang Yian,Yu Fan,Liudmila Zavolokina,Sarah Ebling
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:represent social groups, made remarkable progress, producing high-quality visual, high-quality visual content, textual descriptions
备注: 21 pages, 9 figures. References included
点击查看摘要
Abstract:Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
22. 【2602.04677】REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency
链接:https://arxiv.org/abs/2602.04677
作者:Ondrej Tybl,Lukas Neumann
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:transfers knowledge, Knowledge Distillation, large teacher model, predictive distributions, Robust Estimator Distillation
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
23. 【2602.04672】AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
链接:https://arxiv.org/abs/2602.04672
作者:Jin-Chuan Shi,Binhong Ye,Tao Liu,Junzhe He,Yangjinhui Xu,Xiaoyang Liu,Zeju Li,Hao Chen,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:Reconstructing dynamic hand-object, dexterous manipulation data, manipulation data collection, creating realistic digital, realistic digital twins
备注: 11 pages
点击查看摘要
Abstract:Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.
24. 【2602.04657】PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
链接:https://arxiv.org/abs/2602.04657
作者:Haokui Zhang,Congyang Ou,Dawei Yan,Peng Wang,Qingsen Yan,Ying Li,Rong Xiao,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accelerate VLM inference, reducing redundant visual, accelerate VLM, reducing redundant, vision-language models
备注:
点击查看摘要
Abstract:Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at this https URL.
25. 【2602.04624】A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction
链接:https://arxiv.org/abs/2602.04624
作者:Raúl Jiménez Cruz,César Torres-Huitzil,Marco Franceschetti,Ronny Seiger,Luciano García-Bañuelos,Barbara Weber
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simulated blood extraction, Similarity Index Measure, labeled images documenting, Structural Similarity Index, data article presents
备注:
点击查看摘要
Abstract:This data article presents a dataset of 11,884 labeled images documenting a simulated blood extraction (phlebotomy) procedure performed on a training arm. Images were extracted from high-definition videos recorded under controlled conditions and curated to reduce redundancy using Structural Similarity Index Measure (SSIM) filtering. An automated face-anonymization step was applied to all videos prior to frame selection. Each image contains polygon annotations for five medically relevant classes: syringe, rubber band, disinfectant wipe, gloves, and training arm. The annotations were exported in a segmentation format compatible with modern object detection frameworks (e.g., YOLOv8), ensuring broad usability. This dataset is partitioned into training (70%), validation (15%), and test (15%) subsets and is designed to advance research in medical training automation and human-object interaction. It enables multiple applications, including phlebotomy tool detection, procedural step recognition, workflow analysis, conformance checking, and the development of educational systems that provide structured feedback to medical trainees. The data and accompanying label files are publicly available on Zenodo.
26. 【2602.04585】ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry
链接:https://arxiv.org/abs/2602.04585
作者:Marcin Możejko,Dawid Uchal,Krzysztof Gogolewski,Piotr Kupidura,Szymon Łukasik,Jakub Giezgała,Tomasz Nocoń,Kacper Pietrzyk,Robert Pieniuta,Mateusz Sulimowicz,Michal Orzyłowski,Tomasz Siłkowski,Karol Zagródka,Eike Staub,Ewa Szczurek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spatial tissue profiling, enables large-scale spatial, large-scale spatial tissue, imaging mass cytometry, high-throughput multiplex imaging
备注: 17 pages, 6 figures
点击查看摘要
Abstract:We present ImmuVis, an efficient convolutional foundation model for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest to-date dataset, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms SOTA baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical, efficient foundation model for real-world IMC modeling.
27. 【2602.04584】SalFormer360: a transformer-based saliency estimation model for 360-degree videos
链接:https://arxiv.org/abs/2602.04584
作者:Mahmoud Z. A. Wahba,Francesco Barbato,Sara Baldoni,Federica Battisti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recent years due, received growing attention, range of applications, received growing, recent years
备注:
点击查看摘要
Abstract:Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
28. 【2602.04583】PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
链接:https://arxiv.org/abs/2602.04583
作者:Gabriele Magrini,Federico Becattini,Niccolò Biondi,Pietro Pala
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, Deep neural, neural networks, networks for visual, visual perception
备注:
点击查看摘要
Abstract:Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.
29. 【2602.04565】Understanding Degradation with Vision Language Model
链接:https://arxiv.org/abs/2602.04565
作者:Guanzhou Lan,Chenyi Liao,Yuqi Yang,Qianli Ma,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, Understanding visual degradations, critical yet challenging, challenging problem, problem in computer
备注: 17 pages
点击查看摘要
Abstract:Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
30. 【2602.04549】Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models
链接:https://arxiv.org/abs/2602.04549
作者:Cem Eteke,Enzo Tartaglione
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, revolutionized novel view, view rendering, Splatting, sparse Gaussians
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses sparse Gaussians. This enables real-time performance but increases space requirements, hindering applications such as immersive communication. 3DGS compression emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, compression introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS compression through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. The code will be open-sourced upon acceptance.
31. 【2602.04547】OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis
链接:https://arxiv.org/abs/2602.04547
作者:Luca Zedda,Andrea Loddo,Cecilia Di Ruberto
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:analysis increasingly benefits, Radiological analysis increasingly, support heterogeneous downstream, heterogeneous downstream tasks, analysis increasingly
备注: 19 pages, 4 figures, 12 tables
点击查看摘要
Abstract:Radiological analysis increasingly benefits from pretrained visual representations that can support heterogeneous downstream tasks across imaging modalities. In this work, we introduce OmniRad, a self-supervised radiological foundation model pretrained on 1.2 million medical images, designed with radiology-inspired principles emphasizing representation reuse and cross-task transferability. We evaluate the pretrained encoder under multiple downstream adaptation regimes, including lightweight task-specific adapters with a frozen backbone as well as full end-to-end fine-tuning for classification, allowing us to assess both representation quality and task-specific performance. OmniRad is evaluated on a broad suite of public benchmarks spanning classification and segmentation across multiple modalities. On the MedMNISTv2 collection, OmniRad improves classification F1 by up to 2.05% over competing foundation models. For dense prediction, OmniRad attains mean Dice score improvements across six MedSegBench datasets when using frozen representations. Qualitative analyses and latent-space visualizations suggest improved feature clustering and modality-related separation.
32. 【2602.04525】SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking
链接:https://arxiv.org/abs/2602.04525
作者:Muhammad Taha Mukhtar(1 and 2),Syed Musa Ali Kazmi(1),Khola Naseem(2),Muhammad Ali Chattha(2),Andreas Dengel(2),Sheraz Ahmed(2),Muhammad Naseer Bajwa(1),Muhammad Imran Malik(1) ((1) National University of Sciences and Technology (NUST), Islamabad, Pakistan, (2) German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Rapid urban expansion, Rapid urban, Pakistan and Mumbai, India serving, middle-income countries
备注: 10 pages, 8 figures, 5 tables
点击查看摘要
Abstract:Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling 1,869 $\text{km}^2$ of area. To evaluate the global robustness of our framework, we extend our experiments to five additional established benchmarks, encompassing eight cities across three continents, and provide comprehensive data quality assessments of all datasets. We also propose a new semi-supervised segmentation framework designed to mitigate the class imbalance and feature degradation inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression and a Prototype Bank System that enforces semantic consistency by anchoring predictions to historically learned high-fidelity feature representations. Extensive experiments across a total of eight cities spanning three continents demonstrate that our approach outperforms state-of-the-art semi-supervised baselines. Most notably, our method demonstrates superior domain transfer capability whereby a model trained on only 10% of source labels reaches a 0.461 mIoU on unseen geographies and outperforms the zero-shot generalization of fully supervised models.
33. 【2602.04517】S-MUSt3R: Sliding Multi-view 3D Reconstruction
链接:https://arxiv.org/abs/2602.04517
作者:Leonid Antsfeld,Boris Chidlovskii,Yohann Cabon,Vincent Leroy,Jerome Revaud
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:recent paradigm shift, vision led, perception from uncalibrated, uncalibrated images, recent paradigm
备注: 8 pages, 5 figures, 5 tables
点击查看摘要
Abstract:The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
34. 【2602.04515】EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
链接:https://arxiv.org/abs/2602.04515
作者:Yu Bai,MingMing Yu,Chaojie Li,Ziyi Bai,Xinlong Wang,Börje F. Karlsson
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deploying humanoid robots, demands tight integration, dynamically changing environments, Deploying humanoid, fundamentally challenging
备注:
点击查看摘要
Abstract:Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.
35. 【2602.04476】Vision-aligned Latent Reasoning for Multi-modal Large Language Model
链接:https://arxiv.org/abs/2602.04476
作者:Byungwoo Jeon,Yoonwoo Jeong,Hyunseok Lee,Minsu Cho,Jinwoo Shin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, require extensive multi-step
备注: 18 pages; 5 figures
点击查看摘要
Abstract:Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
36. 【2602.04473】SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening
链接:https://arxiv.org/abs/2602.04473
作者:Junjie Li,Congyang Ou,Haokui Zhang,Guoting Wei,Shengqin Jiang,Ying Li,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:insights for Pan-sharpening, Pan-sharpening and notably, diffusion models bring, bring novel insights, notably boost fusion
备注:
点击查看摘要
Abstract:Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.
37. 【2602.04462】mporal Slowness in Central Vision Drives Semantic Object Learning
链接:https://arxiv.org/abs/2602.04462
作者:Timothy Schaumlöffel,Arthur Aubret,Gemma Roig,Jochen Triesch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:egocentric visual streams, minimal supervision, object representations, semantic object representations, streams with minimal
备注: ICLR 2026
点击查看摘要
Abstract:Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
38. 【2602.04454】Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search
链接:https://arxiv.org/abs/2602.04454
作者:Tianming Liang,Qirui Du,Jian-Fang Hu,Haichao Jiang,Zicheng Lin,Wei-Shi Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, popular topic, topic in computer, Segmentation, MLLMs
备注:
点击查看摘要
Abstract:Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at this https URL.
39. 【2602.04441】SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking
链接:https://arxiv.org/abs/2602.04441
作者:Weiguang Zhao,Haoran Xu,Xingyu Miao,Qin Zhao,Rui Zhang,Kaizhu Huang,Ning Gao,Peizhou Cao,Mingze Sun,Mulin Yu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern foundation models, follow visual points, Point tracking, Point tracking aims, general point tracking
备注:
点击查看摘要
Abstract:Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.
40. 【2602.04439】rajVG: 3D Trajectory-Coupled Visual Geometry Learning
链接:https://arxiv.org/abs/2602.04439
作者:Xingyu Miao,Weiguang Zhao,Tao Lu,Linning Yu,Mulin Yu,Yang Long,Jiangmiao Pang,Junting Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Feed-forward multi-frame, models often degrade, object motion, Feed-forward, Abstract
备注:
点击查看摘要
Abstract:Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
41. 【2602.04416】Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare
链接:https://arxiv.org/abs/2602.04416
作者:Aavash Chhetri,Bibek Niroula,Pratik Shrestha,Yash Raj Shrestha,Lesley A Anderson,Prashnna K Gyawali,Loris Bazzani,Binod Bhattarai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enables collaborative model, collaborative model training, preserving data privacy, decentralized medical institutions, enables collaborative
备注:
点击查看摘要
Abstract:Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at this https URL .
42. 【2602.04411】Self-evolving Embodied AI
链接:https://arxiv.org/abs/2602.04411
作者:Tongtong Feng,Xin Wang,Wenwu Zhu
类目:Emerging Technologies (cs.ET); Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligent system formed, active perception, action interaction, intelligent system, system formed
备注:
点击查看摘要
Abstract:Embodied Artificial Intelligence (AI) is an intelligent system formed by agents and their environment through active perception, embodied cognition, and action interaction. Existing embodied AI remains confined to human-crafted setting, in which agents are trained on given memory and construct models for given tasks, enabling fixed embodiments to interact with relatively static environments. Such methods fail in in-the-wild setting characterized by variable embodiments and dynamic open environments. This paper introduces self-evolving embodied AI, a new paradigm in which agents operate based on their changing state and environment with memory self-updating, task self-switching, environment self-prediction, embodiment self-adaptation, and model self-evolution, aiming to achieve continually adaptive intelligence with autonomous evolution. Specifically, we present the definition, framework, components, and mechanisms of self-evolving embodied AI, systematically review state-of-the-art works for realized components, discuss practical applications, and point out future research directions. We believe that self-evolving embodied AI enables agents to autonomously learn and interact with environments in a human-like manner and provide a new perspective toward general artificial intelligence.
43. 【2602.04406】LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration
链接:https://arxiv.org/abs/2602.04406
作者:Jue Gong,Zihan Zhou,Jingkai Wang,Shu Li,Libo Liu,Jianliang Lan,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:restoring degraded human-centric, degraded human-centric images, human body restoration, Existing methods, restoring degraded
备注: 8 pages, 7 figures. The code and model will be at [this https URL](https://github.com/gobunu/LCUDiff)
点击查看摘要
Abstract:Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-preserving adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a decoder router (DeR) for per-sample decoder routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while preserving one-step efficiency. The code and model will be at this https URL.
44. 【2602.04405】Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion
链接:https://arxiv.org/abs/2602.04405
作者:Yixin Zhu,Long Lv,Pingping Zhang,Xuehu Liu,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:retaining texture details, preserving significant information, produce fused images, Interactive Spatial-Frequency Fusion, Multi-Modal Image Fusion
备注: This work is accepted by IEEE Transactions on Image Processing. More modifications may be performed
点击查看摘要
Abstract:Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at this https URL.
45. 【2602.04401】Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition
链接:https://arxiv.org/abs/2602.04401
作者:Dhyey Manish Rajani,Michael Milford,Tobias Fischer
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Place Recognition, Visual Place, Place Recognition, performance critically depends, image matching threshold
备注:
点击查看摘要
Abstract:Visual Place Recognition (VPR) is a key component for localisation in GNSS-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that, given a user-defined precision requirement, automatically selects the operating point of a VPR system to maximise recall. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets, making the method robust to sampling variability. Experiments with multiple state-of-the-art VPR techniques and datasets show that the proposed approach consistently outperforms the state-of-the-art, delivering up to 25% higher recall in high-precision operating regimes. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code will be released upon acceptance.
46. 【2602.04381】Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture
链接:https://arxiv.org/abs/2602.04381
作者:Weihao Gao,Zhuo Deng,Zheng Gong,Lan Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:accurate polyp identification, colorectal cancer hinges, Early detection, hinges on real-time, accurate polyp
备注: 19pages, 5 figures
点击查看摘要
Abstract:Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme-compression regime (0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-decoder widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains 94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-compression domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.
47. 【2602.04361】SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
链接:https://arxiv.org/abs/2602.04361
作者:Zekun Li,Ning Wang,Tongxin Bai,Changwang Mei,Peisong Wang,Shuang Qiu,Jian Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:innovative next-scale prediction, next-scale prediction paradigm, garnered significant attention, modeling has garnered, garnered significant
备注:
点击查看摘要
Abstract:Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{ 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at this https URL.
48. 【2602.04356】When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models
链接:https://arxiv.org/abs/2602.04356
作者:Jaehyun Kwak,Nam Cao,Boryeong Cho,Segyu Lee,Sumyeong Ahn,Se-Young Yun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, modern multimodal systems, exposing safety vulnerabilities, Vision-Language Models, Large Vision-Language
备注: Pre-print
点击查看摘要
Abstract:Adversarial attacks against Large Vision-Language Models (LVLMs) are crucial for exposing safety vulnerabilities in modern multimodal systems. Recent attacks based on input transformations, such as random cropping, suggest that spatially localized perturbations can be more effective than global image manipulation. However, randomly cropping the entire image is inherently stochastic and fails to use the limited per-pixel perturbation budget efficiently. We make two key observations: (i) regional attention scores are positively correlated with adversarial loss sensitivity, and (ii) attacking high-attention regions induces a structured redistribution of attention toward subsequent salient regions. Based on these findings, we propose Stage-wise Attention-Guided Attack (SAGA), an attention-guided framework that progressively concentrates perturbations on high-attention regions. SAGA enables more efficient use of constrained perturbation budgets, producing highly imperceptible adversarial examples while consistently achieving state-of-the-art attack success rates across ten LVLMs. The source code is available at this https URL.
49. 【2602.04349】VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image
链接:https://arxiv.org/abs/2602.04349
作者:Teng-Fang Hsiao,Bo-Kai Ruan,Yu-Lun Liu,Hong-Han Shuai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:critical research area, critical research, research area, area to provide, provide users
备注:
点击查看摘要
Abstract:3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbf{VecSet-Edit}, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: this https URL
50. 【2602.04343】Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception
链接:https://arxiv.org/abs/2602.04343
作者:Sebastian Jung,Leonard Klüpfel,Rudolph Triebel,Maximilian Durner
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present Neural Memory, Neural Memory Object, Neural Memory, present Neural, Memory Object
备注: 17 pages including supplement, published in 3DV 2026, Project website: [this https URL](https://sebastian-jung.github.io/nemo/)
点击查看摘要
Abstract:We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. this https URL
51. 【2602.04340】Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning
链接:https://arxiv.org/abs/2602.04340
作者:Qian-Wei Wang,Yaguang Song,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Pre-trained vision-language models, exhibit strong transferability, CLIP exhibit strong, Pre-trained vision-language, budgets remains challenging
备注:
点击查看摘要
Abstract:Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
52. 【2602.04337】Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner
链接:https://arxiv.org/abs/2602.04337
作者:Qian-Wei Wang,Guanghao Meng,Ren Cai,Yaguang Song,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:CLIP exhibit strong, Large-scale vision-language models, strong zero-shot generalization, exhibit strong zero-shot, downstream tasks typically
备注:
点击查看摘要
Abstract:Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.
53. 【2602.04328】Multiview Self-Representation Learning across Heterogeneous Views
链接:https://arxiv.org/abs/2602.04328
作者:Jie Chen,Zhu Wang,Chuanbin Liu,Xi Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibit inherently distinct, model pretraining objectives, inherently distinct feature, objectives or architectures, sample generated
备注: 12 pages
点击查看摘要
Abstract:Features of the same sample generated by different pretrained models often exhibit inherently distinct feature distributions because of discrepancies in the model pretraining objectives or architectures. Learning invariant representations from large-scale unlabeled visual data with various pretrained models in a fully unsupervised transfer manner remains a significant challenge. In this paper, we propose a multiview self-representation learning (MSRL) method in which invariant representations are learned by exploiting the self-representation property of features across heterogeneous views. The features are derived from large-scale unlabeled visual data through transfer learning with various pretrained models and are referred to as heterogeneous multiview data. An individual linear model is stacked on top of its corresponding frozen pretrained backbone. We introduce an information-passing mechanism that relies on self-representation learning to support feature aggregation over the outputs of the linear model. Moreover, an assignment probability distribution consistency scheme is presented to guide multiview self-representation learning by exploiting complementary information across different views. Consequently, representation invariance across different linear models is enforced through this scheme. In addition, we provide a theoretical analysis of the information-passing mechanism, the assignment probability distribution consistency and the incremental views. Extensive experiments with multiple benchmark visual datasets demonstrate that the proposed MSRL method consistently outperforms several state-of-the-art approaches.
54. 【2602.04317】JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction
链接:https://arxiv.org/abs/2602.04317
作者:Zihan Lou,Jinlong Fan,Sihan Ma,Yuxiang Yang,Jing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reconstructing high-fidelity animatable, monocular RGB videos, RGB videos remains, videos remains challenging, Reconstructing high-fidelity
备注: 15 pages, 15 figures, Project page at [this https URL](https://github.com/MiliLab/JOintGS)
点击查看摘要
Abstract:Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the this http URL source code is available at this https URL.
55. 【2602.04315】GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
链接:https://arxiv.org/abs/2602.04315
作者:Guoqing Ma,Siheng Wang,Zeyu Zhang,Shan Yu,Hao Tang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large foundation models, shown strong open-world, strong open-world generalization, Large foundation, vision and language
备注:
点击查看摘要
Abstract:Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: this https URL. Website: this https URL.
56. 【2602.04304】Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
链接:https://arxiv.org/abs/2602.04304
作者:Zipeng Zhu,Zhanghao Hu,Qinglin Zhu,Yuxi Hong,Yijun Liu,Jingyong Su,Yulan He,Lin Gui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, text embedding space, uniform pretraining resolution, fixed visual-token budget, visual-token budget forces
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
57. 【2602.04300】Light Up Your Face: A Physically Consistent Dataset and Diffusion Model for Face Fill-Light Enhancement
链接:https://arxiv.org/abs/2602.04300
作者:Jue Gong,Zihan Zhou,Jingkai Wang,Xiaohong Liu,Yulun Zhang,Xiaokang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:brightens underexposed faces, adding virtual fill, Face fill-light enhancement, original scene illumination, brightens underexposed
备注: 8 pages, 7 figures. The code and model will be available at [this https URL](https://github.com/gobunu/Light-Up-Your-Face)
点击查看摘要
Abstract:Face fill-light enhancement (FFE) brightens underexposed faces by adding virtual fill light while keeping the original scene illumination and background unchanged. Most face relighting methods aim to reshape overall lighting, which can suppress the input illumination or modify the entire scene, leading to foreground-background inconsistency and mismatching practical FFE needs. To support scalable learning, we introduce LightYourFace-160K (LYF-160K), a large-scale paired dataset built with a physically consistent renderer that injects a disk-shaped area fill light controlled by six disentangled factors, producing 160K before-and-after pairs. We first pretrain a physics-aware lighting prompt (PALP) that embeds the 6D parameters into conditioning tokens, using an auxiliary planar-light reconstruction objective. Building on a pretrained diffusion backbone, we then train a fill-light diffusion (FiLitDiff), an efficient one-step model conditioned on physically grounded lighting codes, enabling controllable and high-fidelity fill lighting at low computational cost. Experiments on held-out paired sets demonstrate strong perceptual quality and competitive full-reference metrics, while better preserving background illumination. The dataset and model will be at this https URL.
58. 【2602.04271】SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization
链接:https://arxiv.org/abs/2602.04271
作者:Lifan Wu,Ruijie Zhu,Yubo Ai,Tianzhu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:made remarkable progress, made remarkable, remarkable progress, progress in synthesizing, synthesizing dynamic
备注: Accepted by CVM 2026. Project page: [this https URL](https://wusar.github.io/projects/skeletongaussian)
点击查看摘要
Abstract:4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: this https URL
59. 【2602.04268】KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
链接:https://arxiv.org/abs/2602.04268
作者:Siyu Jiang,Feiyang Chen,Xiaojin Zhang,Kun He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, visually inconsistent objects, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.04268 [cs.CV]
(or
arXiv:2602.04268v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.04268
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
60. 【2602.04260】Decoupled Hierarchical Distillation for Multimodal Emotion Recognition
链接:https://arxiv.org/abs/2602.04260
作者:Yong Li,Yuanzhi Wang,Yi Ding,Shiqing Zhang,Ke Lu,Cuntai Guan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:infer human emotions, Human multimodal emotion, multimodal emotion recognition, human emotions, infer human
备注: arXiv admin note: text overlap with [arXiv:2303.13802](https://arxiv.org/abs/2303.13802)
点击查看摘要
Abstract:Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3\%/2.4\% (ACC$_7$), 1.3\%/1.9\% (ACC$_2$) and 1.9\%/1.8\% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
61. 【2602.04257】Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery
链接:https://arxiv.org/abs/2602.04257
作者:Jiaxin Cen,Xudong Mao,Guanghui Yue,Wei Zhou,Ruomei Wang,Fan Zhou,Baoquan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular video human, video human mesh, human mesh recovery, mesh recovery faces, recovery faces fundamental
备注:
点击查看摘要
Abstract:Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.
62. 【2602.04252】ACIL: Active Class Incremental Learning for Image Classification
链接:https://arxiv.org/abs/2602.04252
作者:Aditya R. Bhattacharya,Debanjan Goswami,Shayok Chakraborty
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:realistic learning scenario, computer vision systems, Continual learning, class incremental learning, previous episodes
备注: BMVC 2024 (Accepted). Authors, Aditya R. Bhattacharya and Debanjan Goswami contributed equally to this work
点击查看摘要
Abstract:Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.
63. 【2602.04251】owards Next-Generation SLAM: A Survey on 3DGS-SLAM Focusing on Performance, Robustness, and Future Directions
链接:https://arxiv.org/abs/2602.04251
作者:Li Wang,Ruixuan Gong,Yumo Han,Lei Yang,Lu Yang,Ying Li,Bin Xu,Huaping Liu,Rong Fu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional Simultaneous Localization, Traditional Simultaneous, Localization and Mapping, Simultaneous Localization, face limitations including
备注:
点击查看摘要
Abstract:Traditional Simultaneous Localization and Mapping (SLAM) systems often face limitations including coarse rendering quality, insufficient recovery of scene details, and poor robustness in dynamic environments. 3D Gaussian Splatting (3DGS), with its efficient explicit representation and high-quality rendering capabilities, offers a new reconstruction paradigm for SLAM. This survey comprehensively reviews key technical approaches for integrating 3DGS with SLAM. We analyze performance optimization of representative methods across four critical dimensions: rendering quality, tracking accuracy, reconstruction speed, and memory consumption, delving into their design principles and breakthroughs. Furthermore, we examine methods for enhancing the robustness of 3DGS-SLAM in complex environments such as motion blur and dynamic environments. Finally, we discuss future challenges and development trends in this area. This survey aims to provide a technical reference for researchers and foster the development of next-generation SLAM systems characterized by high fidelity, efficiency, and robustness.
64. 【2602.04240】SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
链接:https://arxiv.org/abs/2602.04240
作者:Suzeyu Chen,Leheng Li,Ying-Cong Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Achieving highly accurate, Achieving highly, accurate and real-time, occupancy prediction, autonomous vehicles
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at this https URL.
Comments:
8 pages, 6 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:
arXiv:2602.04240 [cs.CV]
(or
arXiv:2602.04240v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.04240
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
65. 【2602.04227】An Intuitionistic Fuzzy Logic Driven UNet architecture: Application to Brain Image segmentation
链接:https://arxiv.org/abs/2602.04227
作者:Hanuman Verma,Kiho Im,Pranabesh Maji,Akshansh Gupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:MRI brain images, medical image computing, MRI brain, Accurate segmentation, diagnosis of neuro-logical
备注:
点击查看摘要
Abstract:Accurate segmentation of MRI brain images is essential for image analysis, diagnosis of neuro-logical disorders and medical image computing. In the deep learning approach, the convolutional neural networks (CNNs), especially UNet, are widely applied in medical image segmentation. However, it is difficult to deal with uncertainty due to the partial volume effect in brain images. To overcome this limitation, we propose an enhanced framework, named UNet with intuitionistic fuzzy logic (IF-UNet), which incorporates intuitionistic fuzzy logic into UNet. The model processes input data in terms of membership, nonmembership, and hesitation degrees, allowing it to better address tissue ambiguity resulting from partial volume effects and boundary uncertainties. The proposed architecture is evaluated on the Internet Brain Segmentation Repository (IBSR) dataset, and its performance is computed using accuracy, Dice coefficient, and intersection over union (IoU). Experimental results confirm that IF-UNet improves segmentation quality with handling uncertainty in brain images.
66. 【2602.04220】Adaptive 1D Video Diffusion Autoencoder
链接:https://arxiv.org/abs/2602.04220
作者:Yao Teng,Minxuan Lin,Xian Liu,Shuai Wang,Xiao Yang,Xihui Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models largely rely, Recent video generation, Recent video, generation models largely, video autoencoders
备注:
点击查看摘要
Abstract:Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
67. 【2602.04204】AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting
链接:https://arxiv.org/abs/2602.04204
作者:Chao Li,Rui Zhang,Siyuan Huang,Xian Zhong,Hongbo Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Human trajectory forecasting, forecasting requires capturing, pedestrian behavior, Human trajectory, requires capturing
备注: 14 pages, 3 figures
点击查看摘要
Abstract:Human trajectory forecasting requires capturing the multimodal nature of pedestrian behavior. However, existing approaches suffer from prior misalignment. Their learned or fixed priors often fail to capture the full distribution of plausible futures, limiting both prediction accuracy and diversity. We theoretically establish that prediction error is lower-bounded by prior quality, making prior modeling a key performance bottleneck. Guided by this insight, we propose AGMA (Adaptive Gaussian Mixture Anchors), which constructs expressive priors through two stages: extracting diverse behavioral patterns from training data and distilling them into a scene-adaptive global prior for inference. Extensive experiments on ETH-UCY, Stanford Drone, and JRDB datasets demonstrate that AGMA achieves state-of-the-art performance, confirming the critical role of high-quality priors in trajectory forecasting.
68. 【2602.04202】VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
链接:https://arxiv.org/abs/2602.04202
作者:Feng Wang,Yichun Shi,Ceyuan Yang,Qiushan Guo,Jingxiang Sun,Alan Yuille,Peng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:work presents VTok, unified video tokenization, video tokenization framework, work presents, video
备注:
点击查看摘要
Abstract:This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
69. 【2602.04193】Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution
链接:https://arxiv.org/abs/2602.04193
作者:Hyeonjae Kim,Dongjin Kim,Eugene Jin,Tae Hyun Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep learning-based super-resolution, shown impressive outcomes, learning-based super-resolution, methods have shown, bicubic downsampling
备注: AAAI 2026
点击查看摘要
Abstract:While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, nonlinear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution (LR) and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.
70. 【2602.04188】DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding
链接:https://arxiv.org/abs/2602.04188
作者:Ning Zhang,Zhengyu Li,Kwong Weng Loh,Mingxi Xu,Qi Wang,Zhengyu Wen,Xiaoyu He,Wei Zhao,Kehong Gong,Mingyuan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Prior masked modeling, methods predominantly study, generation methods predominantly, Prior masked, predominantly study
备注:
点击查看摘要
Abstract:Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement this http URL further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural this http URL qualitative results are available on our project page: this https URL.
71. 【2602.04184】Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models
链接:https://arxiv.org/abs/2602.04184
作者:Angel Martinez-Sanchez,Parthib Roy,Ross Greer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:passenger language guides, Instruction-grounded driving, language guides trajectory, requires vehicles, passenger language
备注:
点击查看摘要
Abstract:Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: this https URL
72. 【2602.04182】HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating
链接:https://arxiv.org/abs/2602.04182
作者:Weidong Hao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Event-based Action Recognition, high temporal resolution, high dynamic range, Event-based Action, Action Recognition
备注:
点击查看摘要
Abstract:Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.
73. 【2602.04170】Partial Ring Scan: Revisiting Scan Order in Vision State Space Models
链接:https://arxiv.org/abs/2602.04170
作者:Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin li,Ming-Ching Chang,Yu-Chee Tseng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:State Space Models, State Space, Space Models, offering lineartime sequence, lineartime sequence processing
备注: 10 pages, 3 figures
点击查看摘要
Abstract:State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.
74. 【2602.04167】Point2Insert: Video Object Insertion via Sparse Point Guidance
链接:https://arxiv.org/abs/2602.04167
作者:Yu Zhou,Xiaoyan Yang,Bojia Zi,Lihan Zhang,Ruijie Sun,Weishi Zheng,Haibin Huang,Chi Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-effort object placement, paper introduces, framework for flexible, popularity of accurate, flexible and user-friendly
备注:
点击查看摘要
Abstract:This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.
75. 【2602.04162】Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity
链接:https://arxiv.org/abs/2602.04162
作者:Chenhe Du,Qing Wu,Xuanyu Tian,Jingyi Yu,Hongjiang Wei,Yuyao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:medical imaging, scientific research, imaging, high demand, demand and essential
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:3D medical imaging is in high demand and essential for clinical diagnosis and scientific research. Currently, diffusion models (DMs) have become an effective tool for medical imaging reconstruction thanks to their ability to learn rich, high-quality data priors. However, learning the 3D data distribution with DMs in medical imaging is challenging, not only due to the difficulties in data collection but also because of the significant computational burden during model training. A common compromise is to train the DMs on 2D data priors and reconstruct stacked 2D slices to address 3D medical inverse problems. However, the intrinsic randomness of diffusion sampling causes severe inter-slice discontinuities of reconstructed 3D volumes. Existing methods often enforce continuity regularizations along the z-axis, which introduces sensitive hyper-parameters and may lead to over-smoothing results. In this work, we revisit the origin of stochasticity in diffusion sampling and introduce Inter-Slice Consistent Stochasticity (ISCS), a simple yet effective strategy that encourages interslice consistency during diffusion sampling. Our key idea is to control the consistency of stochastic noise components during diffusion sampling, thereby aligning their sampling trajectories without adding any new loss terms or optimization steps. Importantly, the proposed ISCS is plug-and-play and can be dropped into any 2D trained diffusion based 3D reconstruction pipeline without additional computational cost. Experiments on several medical imaging problems show that our method can effectively improve the performance of medical 3D imaging problems based on 2D diffusion models. Our findings suggest that controlling inter-slice stochasticity is a principled and practically attractive route toward high-fidelity 3D medical imaging with 2D diffusion priors. The code is available at: this https URL
76. 【2602.04154】Context Determines Optimal Architecture in Materials Segmentation
链接:https://arxiv.org/abs/2602.04154
作者:Mingjian Lu,Pawan K. Tripathi,Mark Shteyn,Debargha Ganguly,Roger H. French,Vipin Chaudhary,Yinghui Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:obscuring deployment-relevant performance, deployment-relevant performance variations, single imaging modalities, segmentation spanning SEM, obscuring deployment-relevant
备注:
点击查看摘要
Abstract:Segmentation architectures are typically benchmarked on single imaging modalities, obscuring deployment-relevant performance variations: an architecture optimal for one modality may underperform on another. We present a cross-modal evaluation framework for materials image segmentation spanning SEM, AFM, XCT, and optical microscopy. Our evaluation of six encoder-decoder combinations across seven datasets reveals that optimal architectures vary systematically by context: UNet excels for high-contrast 2D imaging while DeepLabv3+ is preferred for the hardest cases. The framework also provides deployment feedback via out-of-distribution detection and counterfactual explanations that reveal which microstructural features drive predictions. Together, the architecture guidance, reliability signals, and interpretability tools address a practical gap in materials characterization, where researchers lack tools to select architectures for their specific imaging setup or assess when models can be trusted on new samples.
77. 【2602.04142】JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models
链接:https://arxiv.org/abs/2602.04142
作者:Hiroshi Sasaki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:analyse complex documents, complex documents, expected to analyse, analyse complex, Vision
备注: 7 pages
点击查看摘要
Abstract:Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at this https URL.
78. 【2602.04108】SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy
链接:https://arxiv.org/abs/2602.04108
作者:O. Leon Barbed,José M. M. Montiel,Pascal Fua,Ana C. Murillo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:proposed Tracking Adaptation, Tracking Adaptation supervision, focus on boosting, improve the performance, feature extraction
备注: 12 pages, 5 tables, 6 figures
点击查看摘要
Abstract:In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach's most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.
79. 【2602.04102】DMS2F-HAD: A Dual-branch Mamba-based Spatial-Spectral Fusion Network for Hyperspectral Anomaly Detection
链接:https://arxiv.org/abs/2602.04102
作者:Aayushma Pant,Lakpa Tamang,Tsz-Kwan Lee,Sunil Aryal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:high-dimensional hyperspectral images, Hyperspectral anomaly detection, hyperspectral images, high-dimensional hyperspectral, aims to identify
备注: This paper has been accepted in the WACV 2025 conference in algorithm track
点击查看摘要
Abstract:Hyperspectral anomaly detection (HAD) aims to identify rare and irregular targets in high-dimensional hyperspectral images (HSIs), which are often noisy and unlabelled data. Existing deep learning methods either fail to capture long-range spectral dependencies (e.g., convolutional neural networks) or suffer from high computational cost (e.g., Transformers). To address these challenges, we propose DMS2F-HAD, a novel dual-branch Mamba-based model. Our architecture utilizes Mamba's linear-time modeling to efficiently learn distinct spatial and spectral features in specialized branches, which are then integrated by a dynamic gated fusion mechanism to enhance anomaly localization. Across fourteen benchmark HSI datasets, our proposed DMS2F-HAD not only achieves a state-of-the-art average AUC of 98.78%, but also demonstrates superior efficiency with an inference speed 4.6 times faster than comparable deep learning methods. The results highlight DMS2FHAD's strong generalization and scalability, positioning it as a strong candidate for practical HAD applications.
80. 【2602.04094】VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding
链接:https://arxiv.org/abs/2602.04094
作者:Junbo Zou,Ziheng Huang,Shengjie Zhang,Liwen Zhang,Weining Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Long-form video understanding, understanding remains challenging, capture information distributed, video understanding remains, Long-form video
备注:
点击查看摘要
Abstract:Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
81. 【2602.04063】Sight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation
链接:https://arxiv.org/abs/2602.04063
作者:Jacob S. Leiby,Jialu Yao,Pan Lu,George Hu,Anna Davidian,Shunsuke Koga,Olivia Leung,Pravin Patel,Isabella Tondi Resta,Rebecca Rojansky,Derek Sung,Eric Yang,Paul J. Zhang,Emma Lundberg,Dokyoon Kim,Serena Yeung-Levy,James Zou,Thomas Montine,Jeffrey Nirschl,Zhi Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:support pathology diagnosis, Human Protein Atlas, IHC, disease triage, support pathology
备注:
点击查看摘要
Abstract:Immunohistochemistry (IHC) provides information on protein expression in tissue sections and is commonly used to support pathology diagnosis and disease triage. While AI models for H\E-stained slides show promise, their applicability to IHC is limited due to domain-specific variations. Here we introduce HPA10M, a dataset that contains 10,495,672 IHC images from the Human Protein Atlas with comprehensive metadata included, and encompasses 45 normal tissue types and 20 major cancer types. Based on HPA10M, we trained iSight, a multi-task learning framework for automated IHC staining assessment. iSight combines visual features from whole-slide images with tissue metadata through a token-level attention mechanism, simultaneously predicting staining intensity, location, quantity, tissue type, and malignancy status. On held-out data, iSight achieved 85.5\% accuracy for location, 76.6\% for intensity, and 75.7\% for quantity, outperforming fine-tuned foundation models (PLIP, CONCH) by 2.5--10.2\%. In addition, iSight demonstrates well-calibrated predictions with expected calibration errors of 0.0150-0.0408. Furthermore, in a user study with eight pathologists evaluating 200 images from two datasets, iSight outperformed initial pathologist assessments on the held-out HPA dataset (79\% vs 68\% for location, 70\% vs 57\% for intensity, 68\% vs 52\% for quantity). Inter-pathologist agreement also improved after AI assistance in both held-out HPA (Cohen's $\kappa$ increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert--AI co-assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.
82. 【2602.04054】SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations
链接:https://arxiv.org/abs/2602.04054
作者:Huahua Lin,Katayoun Farrahi,Xiaohao Cai
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:meaningful spatial structure, preserve meaningful spatial, Understanding how neural, neural representations respond, learned features preserve
备注:
点击查看摘要
Abstract:Understanding how neural representations respond to geometric transformations is essential for evaluating whether learned features preserve meaningful spatial structure. Existing approaches primarily assess robustness by comparing model outputs under transformed inputs, offering limited insight into how geometric information is organized within internal representations and failing to distinguish between information loss and re-encoding. In this work, we introduce SEIS (Subspace-based Equivariance and Invariance Scores), a subspace metric for analyzing layer-wise feature representations under geometric transformations, disentangling equivariance from invariance without requiring labels or explicit knowledge of the transformation. Synthetic validation confirms that SEIS correctly recovers known transformations. Applied to trained classification networks, SEIS reveals a transition from equivariance in early layers to invariance in deeper layers, and that data augmentation increases invariance while preserving equivariance. We further show that multi-task learning induces synergistic gains in both properties at the shared encoder, and skip connections restore equivariance lost during decoding.
83. 【2602.04053】Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal
链接:https://arxiv.org/abs/2602.04053
作者:Rio Aguina-Kang,Kevin James Blackburn-Matzen,Thibault Groueix,Vladimir Kim,Matheus Gadelha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modeling objects individually, present SeeingThroughClutter, reconstructing structured, representations from single, single images
备注: To appear in 3DV 2026
点击查看摘要
Abstract:We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: this https URL
84. 【2602.04051】Artifact Removal and Image Restoration in AFM:A Structured Mask-Guided Directional Inpainting Approach
链接:https://arxiv.org/abs/2602.04051
作者:Juntao Zhang,Angona Biswas,Jaydeep Rade,Charchit Shukla,Juan Ren,Anwesha Sarkar,Adarsh Krishnamurthy,Aditya Balu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Atomic Force Microscopy, Atomic Force, Force Microscopy, enables high-resolution surface, high-resolution surface imaging
备注:
点击查看摘要
Abstract:Atomic Force Microscopy (AFM) enables high-resolution surface imaging at the nanoscale, yet the output is often degraded by artifacts introduced by environmental noise, scanning imperfections, and tip-sample interactions. To address this challenge, a lightweight and fully automated framework for artifact detection and restoration in AFM image analysis is presented. The pipeline begins with a classification model that determines whether an AFM image contains artifacts. If necessary, a lightweight semantic segmentation network, custom-designed and trained on AFM data, is applied to generate precise artifact masks. These masks are adaptively expanded based on their structural orientation and then inpainted using a directional neighbor-based interpolation strategy to preserve 3D surface continuity. A localized Gaussian smoothing operation is then applied for seamless restoration. The system is integrated into a user-friendly GUI that supports real-time parameter adjustments and batch processing. Experimental results demonstrate the effective artifact removal while preserving nanoscale structural details, providing a robust, geometry-aware solution for high-fidelity AFM data interpretation.
85. 【2602.04046】Fast, Unsupervised Framework for Registration Quality Assessment of Multi-stain Histological Whole Slide Pairs
链接:https://arxiv.org/abs/2602.04046
作者:Shikha Dubey,Patricia Raciti,Kristopher Standish,Albert Juan Ramon,Erik Ames Burlingame
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:integrated molecular analysis, High-fidelity registration, slide images, hematoxylin eosin, histopathological whole slide
备注: Accepted to IEEE ISBI 2026
点击查看摘要
Abstract:High-fidelity registration of histopathological whole slide images (WSIs), such as hematoxylin eosin (HE) and immunohistochemistry (IHC), is vital for integrated molecular analysis but challenging to evaluate without ground-truth (GT) annotations. Existing WSI-level assessments -- using annotated landmarks or intensity-based similarity metrics -- are often time-consuming, unreliable, and computationally intensive, limiting large-scale applicability. This study proposes a fast, unsupervised framework that jointly employs down-sampled tissue masks- and deformations-based metrics for registration quality assessment (RQA) of registered HE and IHC WSI pairs. The masks-based metrics measure global structural correspondence, while the deformations-based metrics evaluate local smoothness, continuity, and transformation realism. Validation across multiple IHC markers and multi-expert assessments demonstrate a strong correlation between automated metrics and human evaluations. In the absence of GT, this framework offers reliable, real-time RQA with high fidelity and minimal computational resources, making it suitable for large-scale quality control in digital pathology.
86. 【2602.04044】A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications
链接:https://arxiv.org/abs/2602.04044
作者:Panagiotis Mousouliotis,Georgios Keramidas
类目:Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
关键词:Field-Programmable Gate Arrays, Convolutional neural network, Gate Arrays, Convolutional neural, Field-Programmable Gate
备注: 6 pages, 4 figures. Published in the proceedings of the 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2025), Kalamata, Greece, 6-9 July 2025
点击查看摘要
Abstract:Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
87. 【2602.04043】AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting
链接:https://arxiv.org/abs/2602.04043
作者:Joanna Kaleta,Bartosz Świrta,Kacper Kania,Przemysław Spurek,Marek Kowalski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, effective scene representation, rapid and scalable, asset creation, growing demand
备注:
点击查看摘要
Abstract:The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: this https URL.
88. 【2602.04030】CLS : Tightly Coupled Language Text Spotter
链接:https://arxiv.org/abs/2602.04030
作者:Leeje Jang,Yijun Lin,Yao-Yi Chiang,Jerod Weinman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-world images, aims to detect, detect and recognize, Scene text, Scene text spotting
备注:
点击查看摘要
Abstract:Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.
89. 【2602.04009】PromptSplit: Revealing Prompt-Level Disagreement in Generative Models
链接:https://arxiv.org/abs/2602.04009
作者:Mehdi Lotfian,Mohammad Jalali,Farzan Farnia
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Prompt-guided generative, language domains, producing realistic, textual inputs, rapidly expanded
备注:
点击查看摘要
Abstract:Prompt-guided generative AI models have rapidly expanded across vision and language domains, producing realistic and diverse outputs from textual inputs. The growing variety of such models, trained with different data and architectures, calls for principled methods to identify which types of prompts lead to distinct model behaviors. In this work, we propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models. For each compared model pair, PromptSplit constructs a joint prompt--output representation by forming tensor-product embeddings of the prompt and image (or text) features, and then computes the corresponding kernel covariance matrix. We utilize the eigenspace of the weighted difference between these matrices to identify the main directions of behavioral difference across prompts. To ensure scalability, we employ a random-projection approximation that reduces computational complexity to $O(nr^2 + r^3)$ for projection dimension $r$. We further provide a theoretical analysis showing that this approximation yields an eigenstructure estimate whose expected deviation from the full-dimensional result is bounded by $O(1/r^2)$. Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences and isolates the prompts responsible, offering an interpretable tool for detecting where generative models disagree.
90. 【2602.03983】Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement
链接:https://arxiv.org/abs/2602.03983
作者:Weikang Qiu,Tinglin Huang,Aosong Feng,Rex Ying
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:generalist robotic control, robotic control, recently emerged, promising paradigm, paradigm for generalist
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.
91. 【2602.03973】VLS: Steering Pretrained Robot Policies via Vision-Language Models
链接:https://arxiv.org/abs/2602.03973
作者:Shuo Liu,Ishneet Sukhvinder Singh,Yiqing Xu,Jiafei Duan,Ranjay Krishna
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:shifted support surface, amid mild clutter, support surface, mild clutter, shifted support
备注: 11 Pages, Project page: [this https URL](https://vision-language-steering.github.io/webpage/)
点击查看摘要
Abstract:Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: this https URL
92. 【2602.03951】Representation Geometry as a Diagnostic for Out-of-Distribution Robustness
链接:https://arxiv.org/abs/2602.03951
作者:Ali Zia,Farid Hazratian
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG); General Topology (math.GN)
关键词:Robust generalization, shift remains difficult, target-domain labels, similar in-distribution accuracy, remains difficult
备注:
点击查看摘要
Abstract:Robust generalization under distribution shift remains difficult to monitor and optimize in the absence of target-domain labels, as models with similar in-distribution accuracy can exhibit markedly different out-of-distribution (OOD) performance. While prior work has focused on training-time regularization and low-order representation statistics, little is known about whether the geometric structure of learned embeddings provides reliable post-hoc signals of robustness. We propose a geometry-based diagnostic framework that constructs class-conditional mutual k-nearest-neighbor graphs from in-distribution embeddings and extracts two complementary invariants: a global spectral complexity proxy based on the reduced log-determinant of the normalized Laplacian, and a local smoothness measure based on Ollivier--Ricci curvature. Across multiple architectures, training regimes, and corruption benchmarks, we find that lower spectral complexity and higher mean curvature consistently predict stronger OOD accuracy across checkpoints. Controlled perturbations and topological analyses further show that these signals reflect meaningful representation structure rather than superficial embedding statistics. Our results demonstrate that representation geometry enables interpretable, label-free robustness diagnosis and supports reliable unsupervised checkpoint selection under distribution shift.
93. 【2602.03918】Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers
链接:https://arxiv.org/abs/2602.03918
作者:Peihao Xiang,Kaida Wu,Ou Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dominant pretraining paradigm, size poses significant, poses significant challenges, Masked self-supervised vision, self-supervised vision transformers
备注:
点击查看摘要
Abstract:Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7\% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.
94. 【2602.03916】SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
链接:https://arxiv.org/abs/2602.03916
作者:Azmine Toushik Wasi,Wahid Faisal,Abdur Rahman,Mahfuz Ahmed Anik,Munem Shahriar,Mohsin Mahmud Topu,Sadia Tasnim Meem,Rahatun Nesa Priti,Sabrina Afroz Mitu,Md. Iqramul Hoque,Shahriyar Zaman Ridoy,Mohammed Eunus Ali,Majd Hawasly,Mohammad Raza,Md Rizwan Parvez
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Spatial reasoning, contemporary vision-language models, Spatial, fundamental aspect, contemporary vision-language
备注: Accepted to ICLR 2026. 92 Pages. 42 Figures and 29 Tables
点击查看摘要
Abstract:Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth Occlusion, Orientation, Size Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: this https URL.
95. 【2602.03915】Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science
链接:https://arxiv.org/abs/2602.03915
作者:Levi Lingsch,Georgios Kissas,Johannes Jakubik,Siddhartha Mishra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
关键词:modern deep learning, transforming high-dimensional data, efficiently learned, discrete representations, modern deep
备注: 57 pages, 27 figures
点击查看摘要
Abstract:Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences that can be efficiently learned, generated, and generalized to new tasks. These have become foundational for image and video generation and, more recently, physical simulation. As existing tokenizers are designed for the explicit requirements of realistic visual perception of images, it is necessary to ask whether these approaches are optimal for scientific images, which exhibit a large dynamic range and require token embeddings to retain physical and spectral properties. In this work, we investigate the accuracy of a suite of image tokenizers across a range of metrics designed to measure the fidelity of PDE properties in both physical and spectral space. Based on the observation that these struggle to capture both fine details and precise magnitudes, we propose Phaedra, inspired by classical shape-gain quantization and proper orthogonal decomposition. We demonstrate that Phaedra consistently improves reconstruction across a range of PDE datasets. Additionally, our results show strong out-of-distribution generalization capabilities to three tasks of increasing complexity, namely known PDEs with different conditions, unknown PDEs, and real-world Earth observation and weather data.
96. 【2602.03913】Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
链接:https://arxiv.org/abs/2602.03913
作者:Qiuming Luo,Tao Zeng,Feng Li,Heming Liu,Rui Mao,Chang Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Chinese Character Recognition, Handwritten Chinese Character, Zero-shot Handwritten Chinese, Handwritten Chinese, recognize unseen characters
备注: 37 pages, 8 figures
点击查看摘要
Abstract:Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.
97. 【2602.03908】Beyond the Vehicle: Cooperative Localization by Fusing Point Clouds for GPS-Challenged Urban Scenarios
链接:https://arxiv.org/abs/2602.03908
作者:Kuo-Yi Chao,Ralph Rasshofer,Alois Christian Knoll
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate vehicle localization, Accurate vehicle, environments where GPS, GPS signals, critical challenge
备注: 8 pages, 2 figures, Driving the Future Symposium 2025
点击查看摘要
Abstract:Accurate vehicle localization is a critical challenge in urban environments where GPS signals are often unreliable. This paper presents a cooperative multi-sensor and multi-modal localization approach to address this issue by fusing data from vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) systems. Our approach integrates cooperative data with a point cloud registration-based simultaneous localization and mapping (SLAM) algorithm. The system processes point clouds generated from diverse sensor modalities, including vehicle-mounted LiDAR and stereo cameras, as well as sensors deployed at intersections. By leveraging shared data from infrastructure, our method significantly improves localization accuracy and robustness in complex, GPS-noisy urban scenarios.
98. 【2602.03907】HY3D-Bench: Generation of 3D Assets
链接:https://arxiv.org/abs/2602.03907
作者:Team Hunyuan3D:Bowen Zhang,Chunchao Guo,Dongyuan Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jiaao Yu,Jiachen Xu,Jingwei Huang,Kunhong Li,Lifu Wang,Linus,Penghao Wang,Qingxiang Lin,Ruining Tang,Xianghui Yang,Yang Li,Yirui Guan,Yunfei Zhao,Yunhan Yang,Zeqiang Lai,Zhihao Liang,Zibo Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:field remains constrained, data processing bottlenecks, significant data processing, models have revolutionized, processing bottlenecks
备注: Authors are listed alphabetically by the first name
点击查看摘要
Abstract:While recent advances in neural representations and generative models have revolutionized 3D content creation, the field remains constrained by significant data processing bottlenecks. To address this, we introduce HY3D-Bench, an open-source ecosystem designed to establish a unified, high-quality foundation for 3D generation. Our contributions are threefold: (1) We curate a library of 250k high-fidelity 3D objects distilled from large-scale repositories, employing a rigorous pipeline to deliver training-ready artifacts, including watertight meshes and multi-view renderings; (2) We introduce structured part-level decomposition, providing the granularity essential for fine-grained perception and controllable editing; and (3) We bridge real-world distribution gaps via a scalable AIGC synthesis pipeline, contributing 125k synthetic assets to enhance diversity in long-tail categories. Validated empirically through the training of Hunyuan3D-2.1-Small, HY3D-Bench democratizes access to robust data resources, aiming to catalyze innovation across 3D perception, robotics, and digital content creation.
99. 【2602.03895】Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
链接:https://arxiv.org/abs/2602.03895
作者:Xuwei Tan,Ziyu Hu,Xueru Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Machine learning models, raising urgent concerns, Machine learning, learning models trained, social groups
备注: Accepted at ICLR 26
点击查看摘要
Abstract:Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
100. 【2602.03894】Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study
链接:https://arxiv.org/abs/2602.03894
作者:Hugo Markoff,Stefan Hein Bengtson,Michael Ørsted
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:biodiversity monitoring efforts, Manual labeling, animal images remains, ecological research, limiting the scale
备注:
点击查看摘要
Abstract:Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.
101. 【2602.03893】GPAIR: Gaussian-Kernel-Based Ultrafast 3D Photoacoustic Iterative Reconstruction
链接:https://arxiv.org/abs/2602.03893
作者:Yibing Wang,Shuang Li,Tingting Huang,Yu Zhang,Chulhong Kim,Seongwook Choi,Changhui Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long reconstruction times, correct reconstruction artifacts, substantially correct reconstruction, computed tomography, substantially correct
备注:
点击查看摘要
Abstract:Although the iterative reconstruction (IR) algorithm can substantially correct reconstruction artifacts in photoacoustic (PA) computed tomography (PACT), it suffers from long reconstruction times, especially for large-scale three-dimensional (3D) imaging in which IR takes hundreds of seconds to hours. The computing burden severely limits the practical applicability of IR algorithms. In this work, we proposed an ultrafast IR method for 3D PACT, called Gaussian-kernel-based Ultrafast 3D Photoacoustic Iterative Reconstruction (GPAIR), which achieves orders-of-magnitude acceleration in computing. GPAIR transforms traditional spatial grids with continuous isotropic Gaussian kernels. By deriving analytical closed-form expression for pressure waves and implementing powerful GPU-accelerated differentiable Triton operators, GPAIR demonstrates extraordinary ultrafast sub-second reconstruction speed for 3D targets containing 8.4 million voxels in animal experiments. This revolutionary ultrafast image reconstruction enables near-real-time large-scale 3D PA reconstruction, significantly advancing 3D PACT toward clinical applications.
102. 【2602.03892】Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
链接:https://arxiv.org/abs/2602.03892
作者:Jinxing Zhou,Yanghao Zhou,Yaoting Wang,Zongyan Han,Jiaqi Ma,Henghui Ding,Rao Muhammad Anwer,Hisham Cholakkal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Language-referred audio-visual segmentation, segment target objects, Language-referred audio-visual, mask quality, aims to segment
备注:
点击查看摘要
Abstract:Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at this https URL.
103. 【2602.03890】4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping
链接:https://arxiv.org/abs/2602.03890
作者:Xindan Zhang,Weilong Yan,Yufei Shi,Xuerui Qiu,Tao He,Ying Li,Ming Li,Hehe Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, point cloud, dynamic point cloud, Point clouds provide, point cloud understanding
备注:
点击查看摘要
Abstract:Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.
104. 【2602.03883】Explainable Computer Vision Framework for Automated Pore Detection and Criticality Assessment in Additive Manufacturing
链接:https://arxiv.org/abs/2602.03883
作者:Akshansh Mishra,Rakesh Morisetty
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
关键词:Internal porosity remains, compromising structural performance, limiting industrial adoption, Internal porosity, critical defect mode
备注: 6 figures
点击查看摘要
Abstract:Internal porosity remains a critical defect mode in additively manufactured components, compromising structural performance and limiting industrial adoption. Automated defect detection methods exist but lack interpretability, preventing engineers from understanding the physical basis of criticality predictions. This study presents an explainable computer vision framework for pore detection and criticality assessment in three-dimensional tomographic volumes. Sequential grayscale slices were reconstructed into volumetric datasets, and intensity-based thresholding with connected component analysis identified 500 individual pores. Each pore was characterized using geometric descriptors including size, aspect ratio, extent, and spatial position relative to the specimen boundary. A pore interaction network was constructed using percentile-based Euclidean distance criteria, yielding 24,950 inter-pore connections. Machine learning models predicted pore criticality scores from extracted features, and SHAP analysis quantified individual feature contributions. Results demonstrate that normalized surface distance dominates model predictions, contributing more than an order of magnitude greater importance than all other descriptors. Pore size provides minimal influence, while geometric parameters show negligible impact. The strong inverse relationship between surface proximity and criticality reveals boundary-driven failure mechanisms. This interpretable framework enables transparent defect assessment and provides actionable insights for process optimization and quality control in additive manufacturing.
105. 【2602.03882】PriorProbe: Recovering Individual-Level Priors for Personalizing Neural Networks in Facial Expression Recognition
链接:https://arxiv.org/abs/2602.03882
作者:Haijiang Yan,Nick Chater,Adam Sanborn
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Incorporating individual-level cognitive, introduce systematic biases, Markov Chain Monte, Chain Monte Carlo, Incorporating individual-level
备注:
点击查看摘要
Abstract:Incorporating individual-level cognitive priors offers an important route to personalizing neural networks, yet accurately eliciting such priors remains challenging: existing methods either fail to uniquely identify them or introduce systematic biases. Here, we introduce PriorProbe, a novel elicitation approach grounded in Markov Chain Monte Carlo with People that recovers fine-grained, individual-specific priors. Focusing on a facial expression recognition task, we apply PriorProbe to individual participants and test whether integrating the recovered priors with a state-of-the-art neural network improves its ability to predict an individual's classification on ambiguous stimuli. The PriorProbe-derived priors yield substantial performance gains, outperforming both the neural network alone and alternative sources of priors, while preserving the network's inference on ground-truth labels. Together, these results demonstrate that PriorProbe provides a general and interpretable framework for personalizing deep neural networks.
106. 【2602.03881】DiGAN: Diffusion-Guided Attention Network for Early Alzheimer's Disease Detection
链接:https://arxiv.org/abs/2602.03881
作者:Maxx Richard Rahman,Mostafa Hammouda,Wolfgang Maass
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:major challenge due, temporally irregular progression, Alzheimer disease, Early diagnosis, diagnosis of Alzheimer
备注:
点击查看摘要
Abstract:Early diagnosis of Alzheimer's disease (AD) remains a major challenge due to the subtle and temporally irregular progression of structural brain changes in the prodromal stages. Existing deep learning approaches require large longitudinal datasets and often fail to model the temporal continuity and modality irregularities inherent in real-world clinical data. To address these limitations, we propose the Diffusion-Guided Attention Network (DiGAN), which integrates latent diffusion modelling with an attention-guided convolutional network. The diffusion model synthesizes realistic longitudinal neuroimaging trajectories from limited training data, enriching temporal context and improving robustness to unevenly spaced visits. The attention-convolutional layer then captures discriminative structural--temporal patterns that distinguish cognitively normal subjects from those with mild cognitive impairment and subjective cognitive decline. Experiments on synthetic and ADNI datasets demonstrate that DiGAN outperforms existing state-of-the-art baselines, showing its potential for early-stage AD detection.
107. 【2602.03879】ruKAN: Towards More Efficient Kolmogorov-Arnold Networks Using Truncated Power Functions
链接:https://arxiv.org/abs/2602.03879
作者:Ali Bayeh,Samira Sadaoui,Malek Mouhoub
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Kolmogorov-Arnold Network, learnable activation functions, KAN, address the trade-off, adherence to Kolmogorov-Arnold
备注: 23 pages, 9 figures
点击查看摘要
Abstract:To address the trade-off between computational efficiency and adherence to Kolmogorov-Arnold Network (KAN) principles, we propose TruKAN, a new architecture based on the KAN structure and learnable activation functions. TruKAN replaces the B-spline basis in KAN with a family of truncated power functions derived from k-order spline theory. This change maintains the KAN's expressiveness while enhancing accuracy and training time. Each TruKAN layer combines a truncated power term with a polynomial term and employs either shared or individual knots. TruKAN exhibits greater interpretability than other KAN variants due to its simplified basis functions and knot configurations. By prioritizing interpretable basis functions, TruKAN aims to balance approximation efficacy with transparency. We develop the TruKAN model and integrate it into an advanced EfficientNet-V2-based framework, which is then evaluated on computer vision benchmark datasets. To ensure a fair comparison, we develop various models: MLP-, KAN-, SineKAN and TruKAN-based EfficientNet frameworks and assess their training time and accuracy across small and deep architectures. The training phase uses hybrid optimization to improve convergence stability. Additionally, we investigate layer normalization techniques for all the models and assess the impact of shared versus individual knots in TruKAN. Overall, TruKAN outperforms other KAN models in terms of accuracy, computational efficiency and memory usage on the complex vision task, demonstrating advantages beyond the limited settings explored in prior KAN studies.
108. 【2602.03878】Intellectual Property Protection for 3D Gaussian Splatting Assets: A Survey
链接:https://arxiv.org/abs/2602.03878
作者:Longjie Zhao,Ziming Hong,Jiaxin Huang,Runnan Chen,Mingming Gong,Tongliang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:Gaussian Splatting, scene synthesis, content creation, representation for real-time, enabling applications
备注: A collection of relevant papers is summarized and will be continuously updated at \url{ [this https URL](https://github.com/tmllab/Awesome-3DGS-IP-Protection) }
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has become a mainstream representation for real-time 3D scene synthesis, enabling applications in virtual and augmented reality, robotics, and 3D content creation. Its rising commercial value and explicit parametric structure raise emerging intellectual property (IP) protection concerns, prompting a surge of research on 3DGS IP protection. However, current progress remains fragmented, lacking a unified view of the underlying mechanisms, protection paradigms, and robustness challenges. To address this gap, we present the first systematic survey on 3DGS IP protection and introduce a bottom-up framework that examines (i) underlying Gaussian-based perturbation mechanisms, (ii) passive and active protection paradigms, and (iii) robustness threats under emerging generative AI era, revealing gaps in technical foundations and robustness characterization and indicating opportunities for deeper investigation. Finally, we outline six research directions across robustness, efficiency, and protection paradigms, offering a roadmap toward reliable and trustworthy IP protection for 3DGS assets.
109. 【2602.03850】WebAccessVL: Making an Accessible Web via Violation-Conditioned VLM
链接:https://arxiv.org/abs/2602.03850
作者:Amber Yijia Zheng,Jae Joong Lee,Bedrich Benes,Raymond A. Yeh
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:address Web Content, Content Accessibility Guidelines, Web Content Accessibility, address Web, Accessibility Guidelines
备注:
点击查看摘要
Abstract:We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.
110. 【2602.04237】An Improved Boosted DC Algorithm for Nonsmooth Functions with Applications in Image Recovery
链接:https://arxiv.org/abs/2602.04237
作者:ZeYu Li,Te Qi,TieYong Zeng
类目:Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
关键词:convex functions algorithm, non-convex problems involving, difference of convex, functions algorithm, recently proposed BDCA
备注:
点击查看摘要
Abstract:We propose a new approach to perform the boosted difference of convex functions algorithm (BDCA) on non-smooth and non-convex problems involving the difference of convex (DC) functions. The recently proposed BDCA uses an extrapolation step from the point computed by the classical DC algorithm (DCA) via a line search procedure in a descent direction to get an additional decrease of the objective function and accelerate the convergence of DCA. However, when the first function in DC decomposition is non-smooth, the direction computed by BDCA can be ascent and a monotone line search cannot be performed. In this work, we proposed a monotone improved boosted difference of convex functions algorithm (IBDCA) for certain types of non-smooth DC programs, namely those that can be formulated as the difference of a possibly non-smooth function and a smooth one. We show that any cluster point of the sequence generated by IBDCA is a critical point of the problem under consideration and that the corresponding objective value is monotonically decreasing and convergent. We also present the global convergence and the convergent rate under the Kurdyka-Lojasiewicz property. The applications of IBDCA in image recovery show the effectiveness of our proposed method. The corresponding numerical experiments demonstrate that our IBDCA outperforms DCA and other state-of-the-art DC methods in both computational time and number of iterations.
111. 【2602.04032】MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment
链接:https://arxiv.org/abs/2602.04032
作者:Mayesha Maliha R. Mithila,Mylene C.Q. Farias
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Channel Attention Network, transformer-based architecture designed, image quality assessment, no-reference image quality, Attention Network
备注: Published in ICASSP 2025, 5 pages, 3 figures
点击查看摘要
Abstract:We present the Multi-Scale Spatial Channel Attention Network (MS-SCANet), a transformer-based architecture designed for no-reference image quality assessment (IQA). MS-SCANet features a dual-branch structure that processes images at multiple scales, effectively capturing both fine and coarse details, an improvement over traditional single-scale methods. By integrating tailored spatial and channel attention mechanisms, our model emphasizes essential features while minimizing computational complexity. A key component of MS-SCANet is its cross-branch attention mechanism, which enhances the integration of features across different scales, addressing limitations in previous approaches. We also introduce two new consistency loss functions, Cross-Branch Consistency Loss and Adaptive Pooling Consistency Loss, which maintain spatial integrity during feature scaling, outperforming conventional linear and bilinear techniques. Extensive evaluations on datasets like KonIQ-10k, LIVE, LIVE Challenge, and CSIQ show that MS-SCANet consistently surpasses state-of-the-art methods, offering a robust framework with stronger correlations with subjective human scores.
112. 【2602.03998】AtlasPatch: An Efficient and Scalable Tool for Whole Slide Image Preprocessing in Computational Pathology
链接:https://arxiv.org/abs/2602.03998
作者:Ahmed Alagha,Christopher Leclerc,Yousef Kotp,Omar Metwally,Calvin Moras,Peter Rentopoulos,Ghodsiyeh Rostami,Bich Ngoc Nguyen,Jumanah Baig,Abdelhakim Khellaf,Vincent Quoc-Huy Trinh,Rabeb Mizouni,Hadi Otrok,Jamal Bentahar,Mahdi S. Hosseini
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:typically comprising tissue, computational pathology workflows, AI-driven computational pathology, Whole-slide image, typically comprising
备注: Under review
点击查看摘要
Abstract:Whole-slide image (WSI) preprocessing, typically comprising tissue detection followed by patch extraction, is foundational to AI-driven computational pathology workflows. This remains a major computational bottleneck as existing tools either rely on inaccurate heuristic thresholding for tissue detection, or adopt AI-based approaches trained on limited-diversity data that operate at the patch level, incurring substantial computational complexity. We present AtlasPatch, an efficient and scalable slide preprocessing framework for accurate tissue detection and high-throughput patch extraction with minimal computational overhead. AtlasPatch's tissue detection module is trained on a heterogeneous and semi-manually annotated dataset of ~30,000 WSI thumbnails, using efficient fine-tuning of the Segment-Anything model. The tool extrapolates tissue masks from thumbnails to full-resolution slides to extract patch coordinates at user-specified magnifications, with options to stream patches directly into common image encoders for embedding or store patch images, all efficiently parallelized across CPUs and GPUs. We assess AtlasPatch across segmentation precision, computational complexity, and downstream multiple-instance learning, matching state-of-the-art performance while operating at a fraction of their computational cost. AtlasPatch is open-source and available at this https URL.
113. 【2602.03891】Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
链接:https://arxiv.org/abs/2602.03891
作者:Seohyun Joo,Yoori Oh
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
关键词:video highlight detection, highlight detection aims, highlight detection, Audio-visual video highlight, video highlight
备注: 5 pages, 2 figures, to appear in ICASSP 2026
点击查看摘要
Abstract:Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
114. 【2602.03887】o What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?
链接:https://arxiv.org/abs/2602.03887
作者:Weiming Chen,Xitong Ling,Xidong Wang,Zhenyang Cai,Yijia Guo,Mingxi Fu,Ziyi Zeng,Minxi Ouyang,Jiawen Li,Yizhi Wang,Tian Guan,Benyou Wang,Yonghong He
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:offering strong transferability, Pathology foundation models, downstream clinical tasks, foundation models, offering strong
备注:
点击查看摘要
Abstract:Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: this https URL
115. 【2602.03870】DINO-AD: Unsupervised Anomaly Detection with Frozen DINO-V3 Features
链接:https://arxiv.org/abs/2602.03870
作者:Jiayu Huo,Jingyuan Hong,Liyun Chen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:label-efficient diagnostic systems, identify abnormal regions, Unsupervised anomaly detection, medical images aims, Unsupervised anomaly
备注: Accepted by ISBI 2026, 4 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Unsupervised anomaly detection (AD) in medical images aims to identify abnormal regions without relying on pixel-level annotations, which is crucial for scalable and label-efficient diagnostic systems. In this paper, we propose a novel anomaly detection framework based on DINO-V3 representations, termed DINO-AD, which leverages self-supervised visual features for precise and interpretable anomaly localization. Specifically, we introduce an embedding similarity matching strategy to select a semantically aligned support image and a foreground-aware K-means clustering module to model the distribution of normal features. Anomaly maps are then computed by comparing the query features with clustered normal embeddings through cosine similarity. Experimental results on both the Brain and Liver datasets demonstrate that our method achieves superior quantitative performance compared with state-of-the-art approaches, achieving AUROC scores of up to 98.71. Qualitative results further confirm that our framework produces clearer and more accurate anomaly localization. Extensive ablation studies validate the effectiveness of each proposed component, highlighting the robustness and generalizability of our approach.




