本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新697篇论文,其中:

  • 自然语言处理101
  • 信息检索21
  • 计算机视觉129

自然语言处理

1. 【2602.06036】DFlash: Block Diffusion for Flash Speculative Decoding

链接https://arxiv.org/abs/2602.06036

作者:Jian Chen,Yesheng Liang,Zhijian Liu

类目:Computation and Language (cs.CL)

关键词:poor GPU utilization, deliver strong performance, high inference latency, GPU utilization, Autoregressive large language

备注

点击查看摘要

Abstract:Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

2. 【2602.06025】Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

链接https://arxiv.org/abs/2602.06025

作者:Haozhen Zhang,Haodong Yue,Tao Feng,Quanyu Long,Jianzhu Bao,Bowen Jin,Weizhi Zhang,Xiao Li,Jiaxuan You,Chengwei Qin,Wenya Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Model, Large Language, single context window, discard query-critical information, existing systems rely

备注: Code is available at [this https URL](https://github.com/ViktorAxelsen/BudgetMem)

点击查看摘要

Abstract:Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

3. 【2602.06019】Multi-Token Prediction via Self-Distillation

链接https://arxiv.org/abs/2602.06019

作者:John Kirchenbauer,Abhimanyu Hans,Brian Bartoldson,Micah Goldblum,Ashwinee Panda,Tom Goldstein

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:require training auxiliary, Existing techniques, complex inference pipelines, training auxiliary speculator, deploying complex inference

备注: 8 pages and 5 figures in the main body

点击查看摘要

Abstract:Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $5\%$ drop in accuracy relative to single token decoding performance.

4. 【2602.06015】A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

链接https://arxiv.org/abs/2602.06015

作者:Panagiotis Kaliosis,Adithya V Ganesan,Oscar N.E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, mental health conditions, assess mental health, self-reported PTSD severity

备注: 18 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.

5. 【2602.06000】Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

链接https://arxiv.org/abs/2602.06000

作者:Ali Shendabadi,Parnia Izadirad,Mostafa Salehi,Mahmoud Bijankhan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Speech Emotion Recognition, faced limitations due, sufficiently large datasets, Speech Emotion, Emotion Recognition

备注

点击查看摘要

Abstract:Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

6. 【2602.05992】DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

链接https://arxiv.org/abs/2602.05992

作者:Lizhuo Luo,Shenggui Li,Yonggang Wen,Tianwei Zhang

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, Diffusion large, large language models, large language, promising alternative

备注

点击查看摘要

Abstract:Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at this https URL.

7. 【2602.05975】SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

链接https://arxiv.org/abs/2602.05975

作者:Tiansheng Hu,Yilun Zhao,Canyu Zhang,Arman Cohan,Chen Zhao

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:addressing complex queries, Deep research agents, Deep research, emerged as powerful, addressing complex

备注: Submission to ACL ARR 2026 January

点击查看摘要

Abstract:Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval this http URL evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

8. 【2602.05971】Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

链接https://arxiv.org/abs/2602.05971

作者:Felipe D. Toro-Hernández,Jesuino Vieira Filho,Rodrigo M. Cabral-Carvalho

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词:manipulate meaning, navigate to retrieve, retrieve and manipulate, Semantic, dynamic knowledge space

备注: 10 pages, 6 figures (excluding refs/appendix). Accepted to ICLR 2026

点击查看摘要

Abstract:Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.

9. 【2602.05940】Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

链接https://arxiv.org/abs/2602.05940

作者:Junxiao Liu,Zhijun Wang,Yixiao Li,Zhejian Lai,Liqian Huang,Xin Huang,Xue Han,Junlan Feng,Shujian Huang

类目:Computation and Language (cs.CL)

关键词:Long reasoning models, accuracies drop substantially, reason in English, English for non-English, Long reasoning

备注: 16 pages, 11 figures

点击查看摘要

Abstract:Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.

10. 【2602.05932】Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

链接https://arxiv.org/abs/2602.05932

作者:Léo Labat,Etienne Ollion,François Yvon

类目:Computation and Language (cs.CL)

关键词:reasoning abilities, assess knowledge, encoded in large, multilingual LLMs, value-laden MCQ responses

备注: 17 pages, 5 figures (8 pages of references and appendices)

点击查看摘要

Abstract:Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.

11. 【2602.05929】KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

链接https://arxiv.org/abs/2602.05929

作者:Jian Chen,Zhuoran Wang,Jiayu Qin,Ming Li,Meng Wang,Changyou Chen,Yin Chen,Qizhen Weng,Yirui Liu

类目:Computation and Language (cs.CL)

关键词:GPU memory bandwidth, saturate GPU memory, quickly saturate GPU, context length grows, avoid redundant computation

备注

点击查看摘要

Abstract:Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.

12. 【2602.05905】Codified Finite-state Machines for Role-playing

链接https://arxiv.org/abs/2602.05905

作者:Letian Peng,Yupeng Hou,Kun Zhou,Jingbo Shang

类目:Computation and Language (cs.CL)

关键词:Modeling latent character, Modeling latent, large language models, engaging role-playing, finite-state machines

备注

点击查看摘要

Abstract:Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.

13. 【2602.05897】Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

链接https://arxiv.org/abs/2602.05897

作者:Shuo Nie,Hexuan Deng,Chao Wang,Ruiyu Fang,Xuebo Liu,Shuangyong Song,Yu Li,Min Zhang,Xuelong Li

类目:Computation and Language (cs.CL)

关键词:large language models, crucial for enabling, resource-constrained settings, large language, small reasoning models

备注

点击查看摘要

Abstract:As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at this https URL.

14. 【2602.05890】DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

链接https://arxiv.org/abs/2602.05890

作者:Dingwei Zhu,Zhiheng Xi,Shihan Dou,Jiahan Li,Chenhao Huang,Junjie Ye,Sixian Li,Mingxu Chai,Yuhui Wang,Yajie Yang,Ming Zhang,Jiazheng Zhang,Shichun Liu,Caishuang Huang,Yunke Zhang,Yuran Wang,Tao Gui,Xipeng Qiu,Qi Zhang,Xuanjing Huang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:real-world environments remains, environments remains challenging, remains challenging due, LLM post-training, systems in real-world

备注

点击查看摘要

Abstract:Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.

15. 【2602.05885】Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

链接https://arxiv.org/abs/2602.05885

作者:Wei Liu,Jiawei Xu,Yingru Li,Longtao Zheng,Tianjian Li,Qian Liu,Junxian He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:High-quality kernel, scalable AI systems, advance AI development, critical for scalable, enabling LLMs

备注

点击查看摘要

Abstract:High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, this http URL-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for this http URL-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in this https URL.

16. 【2602.05879】EuroLLM-22B: Technical Report

链接https://arxiv.org/abs/2602.05879

作者:Miguel Moura Ramos,Duarte M. Alves,Hippolyte Gisserot-Boukhlef,João Alves,Pedro Henrique Martins,Patrick Fernandes,José Pombal,Nuno M. Guerreiro,Ricardo Rei,Nicolas Boizard,Amin Farajian,Mateusz Klimaszewski,José G. C. de Souza,Barry Haddow,François Yvon,Pierre Colombo,Alexandra Birch,André F. T. Martins

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:official European Union, European Union languages, European Union, Union languages, official European

备注

点击查看摘要

Abstract:This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.

17. 【2602.05874】xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection

链接https://arxiv.org/abs/2602.05874

作者:Adrián Girón,Pablo Miralles,Javier Huertas-Tato,Sergio D'Antonio,David Camacho

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:composite concept defined, Hate speech detection, multiple interacting factors, Hate speech, platform policies

备注

点击查看摘要

Abstract:Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.05874 [cs.CL]

(or
arXiv:2602.05874v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.05874

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Adrián Girón [view email] [v1]
Thu, 5 Feb 2026 16:51:56 UTC (196 KB)

18. 【2602.05863】Constrained Group Relative Policy Optimization

链接https://arxiv.org/abs/2602.05863

作者:Roger Girgis,Rodrigue de Schaetzen,Luke Rowe,Azalée Robitaille,Christopher Pal,Liam Paull

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:Group Relative Policy, critic-free policy learning, Group Relative, constrained policy optimization, constraints remains underexplored

备注: 16 pages, 6 figures

点击查看摘要

Abstract:While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

19. 【2602.05859】DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

链接https://arxiv.org/abs/2602.05859

作者:Xu Wang,Bingqing Jiang,Yu Wan,Baosong Yang,Lingpeng Kong,Difan Zou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Sparse autoencoders, large language models, autoregressive large language, mechanistic interpretability, mechanistic interpretability tools

备注: 23 pages

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.

20. 【2602.05853】RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

链接https://arxiv.org/abs/2602.05853

作者:Siran Liu,Guoxia Wang,Sa Wang,Jinle Zeng,HaoYang Xie,Siyu Lou,JiaBin Yang,DianHai Yu,Haifeng Wang,Chao Yang

类目:Computation and Language (cs.CL)

关键词:models processing long, attention mechanisms poses, dynamic sparse attention, processing long contexts, large language models

备注

点击查看摘要

Abstract:The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$\tau$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.

21. 【2602.05848】DARWIN: Dynamic Agentically Rewriting Self-Improving Network

链接https://arxiv.org/abs/2602.05848

作者:Henry Jiang

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:independent GPT agents, GPT agents, unique training code, evolutionary GPT training, training code

备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:DARWIN is an evolutionary GPT model, utilizing a genetic-algorithm like optimization structure with several independent GPT agents being trained individually using unique training code. Each iteration, the GPT models are prompted to modify the training code of one another in an attempt to improve their performance in a mutation-like manner, and the best GPT agents are then benchmarked and selected for the next iteration by genetic algorithm. For demonstration purposes and due to budget and time constraints, OpenAI API is used to prompt training code improvements and the nanoGPT framework is used as the training code. DARWIN also utilizes persistent JSON-based memory files to track previous reasoning and changes to code to correlate with improvement to model performance. and a bidirectional interface for HITL intervention allowing the model to request upgrades such as additional datasets, training scripts, and restructuring of file hierarchies. In experiments, DARWIN achieved a 1.26 percent improvement in model FLOPS utilization (MFU) and a 2.07 percent improvement to perplexity in 5 iterations of training over baseline configurations, demonstrating promising capabilities as a foundation for scaling evolutionary GPT training.

22. 【2602.05843】OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

链接https://arxiv.org/abs/2602.05843

作者:Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, advancement of Large, Language Models, rapid advancement

备注: 34 pages

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at this https URL

23. 【2602.05842】Reinforcement World Model Learning for LLM-based Agents

链接https://arxiv.org/abs/2602.05842

作者:Xiao Yu,Baolin Peng,Ruize Xu,Yelong Shen,Pengcheng He,Suman Nath,Nikhil Singh,Jiangfeng Gao,Zhou Yu

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, achieved strong performance, language-centric tasks, achieved strong

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $\tau^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $\tau^2$ Bench respectively, while matching the performance of expert-data training.

24. 【2602.05794】FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

链接https://arxiv.org/abs/2602.05794

作者:Aboli Kathar,Aman Kumar,Anusha Kamath,Araveeti Srujan,Ashish Sharma,Chandra Bhushan,Dilip Asbe,Divya Sorate,Duddu Prasanth Kumar,Evan Acharya,Harsh Sharma,Hrithik Kadam,Kanishk Singla,Keyur Doshi,Kiran Praveen,Kolisetty Krishna SK,Krishanu Adhikary,Lokesh MPT,Mayurdeep Sonowal,Nadeem Shaikh,Navya Prakash,Nimit Kothari,Nitin Kukreja,Prashant Devadiga,Rakesh Paul,Ratanjeet Pratap Chauhan,Raunak Kalani,Raviraj Joshi,Shamanth MH,Shantanu Pandey,Shubham Soni,Siddharth Dixit,Smriti Jopat,Sunil Patel,Suraj Singh,Suvradip Paul,Tulasi Pilla,Utkarsh Vaidya,Vineeth Nambiar,Vishal Kanvaty,Yatharth Dedhia

类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Indian digital payment, digital payment systems, Mistral Small, developed for Indian, Indian digital

备注

点击查看摘要

Abstract:We present FiMI (Finance Model for India), a domain-specialized financial language model developed for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.

25. 【2602.05787】Bagging-Based Model Merging for Robust General Text Embeddings

链接https://arxiv.org/abs/2602.05787

作者:Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Wenbo Yang,Daiting Shi,Xueqi Cheng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:information retrieval applications, General-purpose text embedding, large-scale multi-task corpora, range of NLP, NLP and information

备注: 12 pages, 4 figures

点击查看摘要

Abstract:General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (\modelname), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, \modelname naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that \modelname consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

26. 【2602.05769】Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors

链接https://arxiv.org/abs/2602.05769

作者:Adnan Al Ali,Jindřich Helcl,Jindřich Libovický

类目:Computation and Language (cs.CL)

关键词:LLM-based assistants, release of ChatGPT, widely popularised, non-native speakers, Abstract

备注: This paper was accepted to EACL 2026 Student Research Workshop

点击查看摘要

Abstract:LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.

27. 【2602.05758】LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

链接https://arxiv.org/abs/2602.05758

作者:Bowen Ping,Zijun Chen,Yiyao Yu,Tingfeng Hui,Junchi Yan,Baobao Chang

类目:Computation and Language (cs.CL)

关键词:Reinforcement Learning, Learning has emerged, driver for LLM, key driver, LLM reasoning

备注

点击查看摘要

Abstract:Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.

28. 【2602.05728】CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering

链接https://arxiv.org/abs/2602.05728

作者:Hao Yang,Zhiyu Yang,Xupeng Zhang,Wei Wei,Yunjie Zhang,Lin Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:knowledge-intensive question answering, Retrieval-augmented generation, question answering, key paradigm, paradigm for knowledge-intensive

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.05728 [cs.CL]

(or
arXiv:2602.05728v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.05728

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2602.05711】OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

链接https://arxiv.org/abs/2602.05711

作者:Jingze Shi,Zhangyang Peng,Yizhang Zhu,Yifan Wu,Guang Liu,Yuyu Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:improve parameter efficiency, architectures are evolving, evolving towards finer, finer granularity, hardware execution efficiency

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at this https URL.

30. 【2602.05710】Ethology of Latent Spaces

链接https://arxiv.org/abs/2602.05710

作者:Philippe Boisnard

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:study challenges, challenges the presumed, presumed neutrality, adopting an ethological, ethological perspective

备注: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( [this https URL](https://paragraphe.univ-paris8.fr/IMG/pdf/programme_colloque_his9_campuscondorcet_v3.pdf) ) and accepted for publication in double-blind peer review in French in 2026-2027

点击查看摘要

Abstract:This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices. Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points. We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault's notion of the archive, Jameson's ideologeme, and Simondon's theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents.

Comments:
23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( this https URL ) and accepted for publication in double-blind peer review in French in 2026-2027

Subjects:

Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.05710 [cs.CY]

(or
arXiv:2602.05710v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2602.05710

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
31. 【2602.05708】Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration

链接https://arxiv.org/abs/2602.05708

作者:Chuangtao Ma,Zeyu Zhang,Arijit Khan,Sebastian Schelter,Paul Groth

类目:Databases (cs.DB); Computation and Language (cs.CL)

关键词:enhances LLM reasoning, pipelines incur substantial, existing RAG pipelines, RAG pipelines incur, enhances LLM

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this limitation, we introduce CE-RAG4EM, a cost-efficient RAG architecture that reduces computation through blocking-based batch retrieval and generation. We also present a unified framework for analyzing and evaluating RAG systems for entity matching, focusing on blocking-aware optimizations and retrieval granularity. Extensive experiments suggest that CE-RAG4EM can achieve comparable or improved matching quality while substantially reducing end-to-end runtime relative to strong baselines. Our analysis further reveals that key configuration parameters introduce an inherent trade-off between performance and overhead, offering practical guidance for designing efficient and scalable RAG systems for entity matching and data integration.

32. 【2602.05694】Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation

链接https://arxiv.org/abs/2602.05694

作者:Shuting Jiang,Ran Song,Yuxin Huang,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu

类目:Computation and Language (cs.CL)

关键词:Multi-domain machine translation, unified model capable, Multi-domain machine, aims to build, build a unified

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.

33. 【2602.05692】MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

链接https://arxiv.org/abs/2602.05692

作者:Congbo Ma,Yichun Zhang,Yousef Al-Jazzazi,Ahamed Foisal,Laasya Sharma,Yousra Sadqi,Khaled Saleh,Jihad Mallat,Farah E. Shamout

类目:Computation and Language (cs.CL)

关键词:incorrect treatment suggestion, Inaccuracies in existing, generated clinical text, adverse consequences, treatment suggestion

备注

点击查看摘要

Abstract:Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: this https URL.

34. 【2602.05648】Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

链接https://arxiv.org/abs/2602.05648

作者:Giuseppe Samo,Paola Merlo

类目:Computation and Language (cs.CL)

关键词:represent complex verb, complex verb paradigms, Modern Hebrew, tokenization strategies shape, Blackbird Language Matrices

备注: 13 pages, 7 figures, to appear as proceedings of the SIGTURK 2026 Workshop

点击查看摘要

Abstract:We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

35. 【2602.05636】Generative Ontology: When Structured Knowledge Learns to Create

链接https://arxiv.org/abs/2602.05636

作者:Benny Cheung

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Traditional ontologies excel, Traditional ontologies, describing domain structure, Generative Ontology, ontologies excel

备注: 15 pages, 6 figures, 6 tables. Code available at [this https URL](https://github.com/bennycheung/GameGrammarCLI)

点击查看摘要

Abstract:Traditional ontologies excel at describing domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs that lack structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework that synthesizes these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. A multi-agent pipeline assigns specialized roles to different ontology domains: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits. Each agent carrying a professional "anxiety" that prevents shallow, agreeable outputs. Retrieval-augmented generation grounds novel designs in precedents from existing exemplars, while iterative validation ensures coherence between mechanisms and components. We demonstrate the framework through GameGrammar, a system for generating complete tabletop game designs. Given a thematic prompt ("bioluminescent fungi competing in a cave ecosystem"), the pipeline produces structurally complete, playable game specifications with mechanisms, components, victory conditions, and setup instructions. These outputs satisfy ontological constraints while remaining genuinely creative. The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars (music composition, software architecture, culinary arts) is a candidate for Generative Ontology. We argue that constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible.

Comments:
15 pages, 6 figures, 6 tables. Code available at this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.05636 [cs.AI]

(or
arXiv:2602.05636v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.05636

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Benny Cheung [view email] [v1]
Thu, 5 Feb 2026 13:14:20 UTC (8,977 KB)

36. 【2602.05633】CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models

链接https://arxiv.org/abs/2602.05633

作者:Rui Jia,Ruiyi Lan,Fengrui Liu,Zhongxiang Dai,Bo Jiang,Jing Shao,Jingyuan Chen,Guandong Xu,Fei Wu,Min Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, learning in education, advanced the development, student attributes

备注

点击查看摘要

Abstract:Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.

37. 【2602.05630】Rewards as Labels: Revisiting RLVR from a Classification Perspective

链接https://arxiv.org/abs/2602.05630

作者:Zepeng Zhai,Meilin Chen,Jiaxuan Zhao,Junlang Qian,Lei Shen,Yuan Lu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, explicit rule-based supervision, providing explicit rule-based, capabilities of Large

备注: 12 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.

38. 【2602.05628】AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care

链接https://arxiv.org/abs/2602.05628

作者:Alastair Howcroft,Amber Bennett-Weston,Ahmad Khan,Joseff Griffiths,Simon Gay,Jeremy Howick

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:including reduced pain, Toggle, Code Toggle Papers, Toggle Hugging Face, Code

备注: Open Access Invited Review. Systematic review and meta analysis of 15 studies 2023-2024. Published 20 October 2025

点击查看摘要

Abstract:Background: Empathy is widely recognized for improving patient outcomes, including reduced pain and anxiety and improved satisfaction, and its absence can cause harm. Meanwhile, use of artificial intelligence (AI)-based chatbots in healthcare is rapidly expanding, with one in five general practitioners using generative AI to assist with tasks such as writing letters. Some studies suggest AI chatbots can outperform human healthcare professionals (HCPs) in empathy, though findings are mixed and lack synthesis. Sources of data: We searched multiple databases for studies comparing AI chatbots using large language models with human HCPs on empathy measures. We assessed risk of bias with ROBINS-I and synthesized findings using random-effects meta-analysis where feasible, whilst avoiding double counting. Areas of agreement: We identified 15 studies (2023-2024). Thirteen studies reported statistically significantly higher empathy ratings for AI, with only two studies situated in dermatology favouring human responses. Of the 15 studies, 13 provided extractable data and were suitable for pooling. Meta-analysis of those 13 studies, all utilising ChatGPT-3.5/4, showed a standardized mean difference of 0.87 (95% CI, 0.54-1.20) favouring AI (P .00001), roughly equivalent to a two-point increase on a 10-point scale. Areas of controversy: Studies relied on text-based assessments that overlook non-verbal cues and evaluated empathy through proxy raters. Growing points: Our findings indicate that, in text-only scenarios, AI chatbots are frequently perceived as more empathic than human HCPs. Areas timely for developing research: Future research should validate these findings with direct patient evaluations and assess whether emerging voice-enabled AI systems can deliver similar empathic advantages.

Comments:
Open Access Invited Review. Systematic review and meta analysis of 15 studies 2023-2024. Published 20 October 2025

Subjects:

Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

MSC classes:
62-07, 62P15

ACMclasses:
I.2.7; J.3; H.1.2; H.5.2

Cite as:
arXiv:2602.05628 [cs.HC]

(or
arXiv:2602.05628v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2602.05628

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
British Medical Bulletin, Volume 156, Issue 1, December 2025, ldaf017

Related DOI:

https://doi.org/10.1093/bmb/ldaf017

Focus to learn more

            DOI(s) linking to related resources

Submission history From: Alastair Howcroft [view email] [v1]
Thu, 5 Feb 2026 13:09:19 UTC (585 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care, by Alastair Howcroft and 5 other authorsView PDF

view license

Current browse context: cs.HC

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.AI
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

39. 【2602.05599】BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages

链接https://arxiv.org/abs/2602.05599

作者:Subhadip Maji,Arnab Bhattacharya

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:developing effective systems, performances typically lagging, high-resource counterparts due, natural language processing, low-resource languages remains

备注: Accepted as a long paper at IJCNLP-AACL Main Conference

点击查看摘要

Abstract:Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.

40. 【2602.05550】ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

链接https://arxiv.org/abs/2602.05550

作者:Yulong He,Artem Ermakov,Sergey Kovalchuk,Artem Aliev,Dmitry Shalymov

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:core programming language, ArkTS, OpenHarmony ecosystem, core programming, intelligence is hindered

备注

点击查看摘要

Abstract:ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate all existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model will be released publicly and are available at this https URL and this https URL the first systematic benchmark for ArkTS code retrieval.

41. 【2602.05547】Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

链接https://arxiv.org/abs/2602.05547

作者:Shyam Sundhar Ramesh,Xiaotong Ji,Matthieu Zimmer,Sangwoong Yoon,Zhiyong Wang,Haitham Bou Ammar,Aurelien Lucchi,Ilija Bogunovic

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:improve large language, large language models, individual reasoning tasks, RL-based post-training, improve large

备注: Preprint

点击查看摘要

Abstract:RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

42. 【2602.05539】Steering Large Reasoning Models towards Concise Reasoning via Flow Matching

链接https://arxiv.org/abs/2602.05539

作者:Yawei Li,Benjamin Bergner,Yinghan Zhao,Vihang Prakash Patil,Bei Chen,Cheng Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Reasoning Models, overly verbose outputs, Large Reasoning, excel at complex, hampered by overly

备注: This paper has been accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Large Reasoning Models (LRMs) excel at complex reasoning tasks, but their efficiency is often hampered by overly verbose outputs. Prior steering methods attempt to address this issue by applying a single, global vector to hidden representations -- an approach grounded in the restrictive linear representation hypothesis. In this work, we introduce FlowSteer, a nonlinear steering method that goes beyond uniform linear shifts by learning a complete transformation between the distributions associated with verbose and concise reasoning. This transformation is learned via Flow Matching as a velocity field, enabling precise, input-dependent control over the model's reasoning process. By aligning steered representations with the distribution of concise-reasoning activations, FlowSteer yields more compact reasoning than the linear shifts. Across diverse reasoning benchmarks, FlowSteer demonstrates strong task performance and token efficiency compared to leading inference-time baselines. Our work demonstrates that modeling the full distributional transport with generative techniques offers a more effective and principled foundation for controlling LRMs.

43. 【2602.05536】When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

链接https://arxiv.org/abs/2602.05536

作者:Yayuan Li,Ze Peng,Jian Zhang,Jintao Guo,Yue Duan,Yinghuan Shi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:combines multiple fine-tuned, multiple fine-tuned models, merging combines multiple, providing a lightweight, alternative to retraining

备注

点击查看摘要

Abstract:Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: this https URL.

44. 【2602.05515】A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma

链接https://arxiv.org/abs/2602.05515

作者:Ajo Babu George,Anna Mariam John,Athul Anoop,Balu Bhasuran

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:maxillofacial pathology require, Artificial intelligence, pathology require structured, enabled diagnostics, diagnostics in maxillofacial

备注

点击查看摘要

Abstract:Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.

45. 【2602.05512】A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

链接https://arxiv.org/abs/2602.05512

作者:Larissa Pusch,Alexandre Courtiol,Tim Conrad

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, knowledge-intensive domains due, outdated information, limited explainability, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.

46. 【2602.05495】ransport and Merge: Cross-Architecture Merging for Large Language Models

链接https://arxiv.org/abs/2602.05495

作者:Chenhang Cui,Binyun Yang,Fei Shen,Yuxin Chen,Jingnan Zheng,Xiang Wang,An Zhang,Tat-Seng Chua

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:achieve strong capabilities, real-world deployments rely, training data, scaling model capacity, achieve strong

备注

点击查看摘要

Abstract:Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.

47. 【2602.05493】LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation

链接https://arxiv.org/abs/2602.05493

作者:Bingru Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Social Sciences, Humanities and Social, Data annotation remains, Large Language Models, complex semantic tasks

备注

点击查看摘要

Abstract:Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on this https URL.

48. 【2602.05471】Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision

链接https://arxiv.org/abs/2602.05471

作者:Md. Mithun Hossaina,Mashary N. Alrasheedy,Nirban Bhowmick,Shamim Forhad,Md. Shakil Hossain,Sudipto Chaki,Md Shafiqul Islam

类目:Computation and Language (cs.CL)

关键词:Contemporary knowledge-based systems, support intelligent decision-making, knowledge-based systems increasingly, face major challenges, major challenges due

备注

点击查看摘要

Abstract:Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.

49. 【2602.05467】MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

链接https://arxiv.org/abs/2602.05467

作者:Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:Visual Language Navigation, Visual Language, Language Navigation, fundamental capabilities, capabilities for embodied

备注: 9 pages, 2 figures, 5 tables, conference

点击查看摘要

Abstract:Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.

50. 【2602.05447】Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

链接https://arxiv.org/abs/2602.05447

作者:Damon McMillan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:increasingly operate external, Large Language Model, Token-Oriented Object Notation, Large Language, agents increasingly operate

备注: 8 pages, 7 figures, 10 tables, 26 references

点击查看摘要

Abstract:Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.

Comments:
8 pages, 7 figures, 10 tables, 26 references

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.05447 [cs.CL]

(or
arXiv:2602.05447v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.05447

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
51. 【2602.05444】Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

链接https://arxiv.org/abs/2602.05444

作者:Yao Zhou,Zeen Song,Wenwen Qiang,Fengge Wu,Shuyi Zhou,Changwen Zheng,Hui Xiong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, latent internal states, Large Language, model inherent capabilities, Safety alignment mechanisms

备注

点击查看摘要

Abstract:Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.

52. 【2602.05437】Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

链接https://arxiv.org/abs/2602.05437

作者:Basel Mousi,Fahim Dalvi,Shammur Chowdhury,Firoj Alam,Nadir Durrani

类目:Computation and Language (cs.CL)

关键词:visually incorrect interpretations, Vision-language models, accepting culturally plausible, incorrect interpretations, plausible but visually

备注

点击查看摘要

Abstract:Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.

53. 【2602.05419】Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

链接https://arxiv.org/abs/2602.05419

作者:Takumi Goto,Yusuke Sakai,Taro Watanabe

类目:Computation and Language (cs.CL)

关键词:grammatical error correction, Automatic evaluation, error correction, grammatical error, crucial for selecting

备注: Accepted to TACL. This is a pre-MIT Press publication version

点击查看摘要

Abstract:Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from this https URL.

54. 【2602.05413】SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

链接https://arxiv.org/abs/2602.05413

作者:Filip Kučera,Christoph Mandl,Isao Echizen,Radu Timofte,Timo Spinde

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:publication numbers, significant increase, increase in publication, gathering definitions relevant, Definitions

备注: Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code datasets are available at this https URL.

Comments:
Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL)

Cite as:
arXiv:2602.05413 [cs.IR]

(or
arXiv:2602.05413v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.05413

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
55. 【2602.05407】H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration

链接https://arxiv.org/abs/2602.05407

作者:Jun-Min Lee,Meong Hi Son,Edward Choi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:driving growing interest, administration departments handle, Hospital administration departments, requests per day, driving growing

备注

点击查看摘要

Abstract:Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM-based automation. However, prior work has focused primarily on patient--physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H-AdminSim, a comprehensive end-to-end simulation framework that combines realistic data generation with multi-agent-based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H-AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM-driven administrative automation.

56. 【2602.05400】OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

链接https://arxiv.org/abs/2602.05400

作者:Shaobo Wang,Xuan Ouyang,Tianyi Xu,Yuzheng Hu,Jialin Liu,Guo Chen,Tianyu Zhang,Junhao Zheng,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang

类目:Computation and Language (cs.CL)

关键词:text approaches exhaustion, high-quality public text, public text approaches, Data Wall, approaches exhaustion

备注: 45 pages, 7 figures, 8 tables

点击查看摘要

Abstract:As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

57. 【2602.05393】Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

链接https://arxiv.org/abs/2602.05393

作者:Ji Zhao,Yufei Gu,Shitong Shao,Xun Zhou,Liang Xiang,Zeke Xie

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:hindering rapid development, remarkable empirical success, Large Language Models, Large Language, achieve remarkable empirical

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.

58. 【2602.05392】Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances

链接https://arxiv.org/abs/2602.05392

作者:Jiyun Chun,Eric Fosler-Lussier,Michael White,Andrew Perrault

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:adult-child dialogue remains, dialogue remains challenging, remains challenging due, Gunning Fog Index, insufficient context-sensitive metrics

备注

点击查看摘要

Abstract:Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.

59. 【2602.05385】IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models

链接https://arxiv.org/abs/2602.05385

作者:Tao Liu,Jiafan Lu,Bohan Yu,Pengcheng Wu,Liu Haixin,Guoyu Xu,Li Xiangheng,Lixiao Li,Jiaming Hou,Zhao Shijun,Xinglin Lyu,Kunli Zhang,Yuxiang Jia,Hongyin Zan

类目:Computation and Language (cs.CL)

关键词:enabling intuitive interaction, maps natural language, natural language processing, natural language questions, language processing task

备注: 25 pages, 16 figures, 8 tables. Hongyin Zan is corresponding author, Jiafan Lu is first co-author

点击查看摘要

Abstract:Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at this https URL.

60. 【2602.05374】Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

链接https://arxiv.org/abs/2602.05374

作者:Chaimae Abouzahir,Congbo Ma,Nizar Habash,Farah E. Shamout

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, clinical decision support, Large Language, decision support, clinical decision

备注: Accepted to HeaLing-EACL 2026

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.

61. 【2602.05370】PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

链接https://arxiv.org/abs/2602.05370

作者:Jun Rao,Zixiong Yu,Xuebo Liu,Guhan Chen,Jing Li,Jiansheng Wei,Xiaojun Meng,Min Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, aligning Large Language, Iterative Direct Preference, Direct Preference Optimization, Iterative Direct

备注

点击查看摘要

Abstract:Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2N3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

62. 【2602.05366】Multi-Field Tool Retrieval

链接https://arxiv.org/abs/2602.05366

作者:Yichen Tang,Weihang Su,Yiqun Liu,Qingyao Ai

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large Language Models, enables Large Language, Integrating external tools, Language Models, tools enables Large

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Integrating external tools enables Large Language Models (LLMs) to interact with real-world environments and solve complex tasks. Given the growing scale of available tools, effective tool retrieval is essential to mitigate constraints of LLMs' context windows and ensure computational efficiency. Existing approaches typically treat tool retrieval as a traditional ad-hoc retrieval task, matching user queries against the entire raw tool documentation. In this paper, we identify three fundamental challenges that limit the effectiveness of this paradigm: (i) the incompleteness and structural inconsistency of tool documentation; (ii) the significant semantic and granular mismatch between user queries and technical tool documents; and, most importantly, (iii) the multi-aspect nature of tool utility, that involves distinct dimensions, such as functionality, input constraints, and output formats, varying in format and importance. To address these challenges, we introduce Multi-Field Tool Retrieval, a framework designed to align user intent with tool representations through fine-grained, multi-field modeling. Experimental results show that our framework achieves SOTA performance on five datasets and a mixed benchmark, exhibiting superior generalizability and robustness.

63. 【2602.05353】AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

链接https://arxiv.org/abs/2602.05353

作者:Ruijie Shi,Houbin Zhang,Yuecheng Han,Yuheng Wang,Jingru Fan,Runde Yang,Yufan Dang,Huatao Li,Dewen Liu,Yuan Cheng,Chen Qian

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, shown strong capabilities, systems remain difficult, complex problem solving

备注

点击查看摘要

Abstract:Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable stand-in workflow that approximates a black-box system using only input--output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.

64. 【2602.05347】How Do Language Models Acquire Character-Level Information?

链接https://arxiv.org/abs/2602.05347

作者:Soma Sato,Ryohei Sasano

类目:Computation and Language (cs.CL)

关键词:implicitly encode character-level, Language models, provided during training, reported to implicitly, implicitly encode

备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.

65. 【2602.05307】MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

链接https://arxiv.org/abs/2602.05307

作者:Haojin Wang,Yike Wang,Shangbin Feng,Hannaneh Hajishirzi,Yulia Tsvetkov

类目:Computation and Language (cs.CL)

关键词:producing long chains, achieve strong performance, generate redundant reasoning, achieve strong, chains of thought

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.

66. 【2602.05305】FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

链接https://arxiv.org/abs/2602.05305

作者:Zhuokun Chen,Jianfei Cai,Bohan Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Generating long-form content, Generating long-form, modern generative models, long-form content, extended texts

备注

点击查看摘要

Abstract:Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: this https URL.

67. 【2602.05289】owards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science

链接https://arxiv.org/abs/2602.05289

作者:Jingru Fan,Dewen Liu,Yufan Dang,Huatao Li,Yuheng Wang,Wei Liu,Feiyu Duan,Xuanwen Ding,Shu Yao,Lin Wu,Ruijie Shi,Wai-Shing Leung,Yuan Cheng,Zhongyu Wei,Cheng Yang,Chen Qian,Zhiyuan Liu,Maosong Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Large Language Models, Language Models, demonstrating significant effectiveness, Large Language, Recent advancements

备注

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($\Gamma$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $\Gamma$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.

68. 【2602.05281】Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

链接https://arxiv.org/abs/2602.05281

作者:Pengyi Li,Elizaveta Goncharova,Andrey Kuznetsov,Ivan Oseledets

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Reinforcement Learning, Learning with Verifiable, Large Language, Group Relative Policy

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.

69. 【2602.05269】Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

链接https://arxiv.org/abs/2602.05269

作者:David Alejandro Trejo Pizzo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:deployment of Large, Memory Wall, Large Language Models, Hybrid Gated Flow, Large Language

备注: 21 pages, 4 figures, 6 tables. Code and models will be released at [this http URL](http://opencores.ai)

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.

Comments:
21 pages, 4 figures, 6 tables. Code and models will be released at this http URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

MSC classes:
68T05, 68T50

ACMclasses:
I.2.7; I.2.6

Cite as:
arXiv:2602.05269 [cs.LG]

(or
arXiv:2602.05269v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.05269

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
70. 【2602.05261】Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

链接https://arxiv.org/abs/2602.05261

作者:Fanfan Liu,Youyang Yin,Peng Shi,Siqi Yang,Zhixiong Zeng,Haibo Qiu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Verifiable Rewards, Reinforcement Learning, Learning with Verifiable

备注

点击查看摘要

Abstract:Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

71. 【2602.05258】CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

链接https://arxiv.org/abs/2602.05258

作者:Haoran Li,Sucheng Ren,Alan Yuille,Feng Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Rotary Positional Embedding, Large Language Models, Rotary Positional, Positional Embedding, Large Language

备注

点击查看摘要

Abstract:Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at this https URL.

72. 【2602.05252】Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

链接https://arxiv.org/abs/2602.05252

作者:Guangwei Zhang,Jianing Zhu,Cheng Qian,Neil Gong,Rada Mihalcea,Zhaozhuo Xu,Jingrui He,Jiaqi Ma,Yun Huang,Chaowei Xiao,Bo Li,Ahmed Abbasi,Dongwon Lee,Heng Ji,Denghui Zhang

类目:Computation and Language (cs.CL)

关键词:present Copyright Detective, visualizing potential copyright, Copyright Detective, potential copyright risks, visualizing potential

备注

点击查看摘要

Abstract:We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.

73. 【2602.05235】FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters

链接https://arxiv.org/abs/2602.05235

作者:Zhilin Liang,Yuxiang Wang,Zimu Zhou,Hainan Zhang,Boyi Liu,Yongxin Tong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, enhances Large Language, Language Models, Large Language, enhances Large

备注: 11 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.

74. 【2602.05234】Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

链接https://arxiv.org/abs/2602.05234

作者:Yuntai Bao,Xuhong Zhang,Jintao Chen,Ge Su,Yuxiang Cai,Hao Peng,Bing Sun,Haiqin Weng,Liu Yan,Jianwei Yin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:offers a lightweight, lightweight and interpretable, interpretable alternative, alternative to prompting, steering

备注: 55 pages, 25 figures; accepted for ICLR 2026

点击查看摘要

Abstract:Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at this https URL.

75. 【2602.05220】Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

链接https://arxiv.org/abs/2602.05220

作者:Jinchuan Tian,Haoran Wang,Bo-Hao Su,Chien-yu Huang,Qingzheng Wang,Jiatong Shi,William Chen,Xun Gong,Siddhant Arora,Chin-Jou Li,Masao Someki,Takashi Maekaku,Yusuke Shinohara,Jin Sakuma,Chao-Han Huck Yang,Shinji Watanabe

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:addressing isolated factors, Current audio foundation, models typically rely, foundation models typically, Current audio

备注

点击查看摘要

Abstract:Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

76. 【2602.05211】Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective

链接https://arxiv.org/abs/2602.05211

作者:Hongye Zhao,Yi Zhao,Chengzhi Zhang

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:dynamic feedback mechanism, feedback mechanism, reciprocal shaping, shaping and dynamic, dynamic feedback

备注

点击查看摘要

Abstract:The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at this https URL.

77. 【2602.05205】Aligning Large Language Model Behavior with Human Citation Preferences

链接https://arxiv.org/abs/2602.05205

作者:Kenichiro Ando,Tatsuya Harada

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:powerful large-scale language, large-scale language models, enhance credibility, services built, built on powerful

备注: Work In Progress

点击查看摘要

Abstract:Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.

78. 【2602.05189】Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

链接https://arxiv.org/abs/2602.05189

作者:Hsuan-Yu Chou,Wajiha Naveed,Shuyan Zhou,Xiaowei Yang

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:internet access expands, including harmful content, harmful content, harmful content detection, access expands

备注

点击查看摘要

Abstract:As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.

Subjects:

Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

Cite as:
arXiv:2602.05189 [cs.CL]

(or
arXiv:2602.05189v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.05189

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2602.05182】he Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

链接https://arxiv.org/abs/2602.05182

作者:Shangbin Feng,Kishan Panaganti,Yulia Tsvetkov,Wenhao Yu

类目:Computation and Language (cs.CL)

关键词:multiple language models, Model, single model, collaboration, loading multiple LMs

备注: Code at [this https URL](https://github.com/BunsenFeng/moco_distill)

点击查看摘要

Abstract:Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.

80. 【2602.05176】Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

链接https://arxiv.org/abs/2602.05176

作者:Ziyuan Yang,Wenxuan Ding,Shangbin Feng,Yulia Tsvetkov

类目:Computation and Language (cs.CL)

关键词:multiple LMs trained, Language models, multi-agent debate, parties collaborate, collaborate through routing

备注: 19 pages, 15 tables, 4 figures

点击查看摘要

Abstract:Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

81. 【2602.05165】EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

链接https://arxiv.org/abs/2602.05165

作者:Kevin Han,Yuhang Zhou,Mingze Gao,Gedi Zhou,Serena Li,Abhishek Kumar,Xiangjun Fan,Weiwei Li,Lizhu Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, Verifiable Rewards, Relative Policy Optimization

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.

82. 【2602.05150】GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek

链接https://arxiv.org/abs/2602.05150

作者:Yang Zhang,Mersin Konomi,Christos Xypolopoulos,Konstantinos Divriotis,Konstantinos Skianis,Giannis Nikolentzos,Giorgos Stamou,Guokan Shang,Michalis Vazirgiannis

类目:Computation and Language (cs.CL)

关键词:Large Language Models, native-sourced content-remain limited, Greek-particularly those based, Large Language, benchmarks for Greek-particularly

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.

83. 【2602.05115】SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers

链接https://arxiv.org/abs/2602.05115

作者:Keyang Xuan,Pengda Wang,Chongrui Ye,Haofei Yu,Tal August,Jiaxuan You

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, emph, language models, increasingly evaluated

备注: 10 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized communication between agents, limiting our ability to diagnose whether LLMs can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textsc{SocialVeil}, a social learning environment that can simulate social interaction under cognitive-difference-induced communication barriers. Grounded in a systematic literature review of communication challenges in human interaction, \textsc{SocialVeil} introduces three representative types of such disruption, \emph{semantic vagueness}, \emph{sociocultural mismatch}, and \emph{emotional interference}. We also introduce two barrier-aware evaluation metrics, \emph{unresolved confusion} and \emph{mutual understanding}, to evaluate interaction quality under impaired communication. Experiments across 720 scenarios and four frontier LLMs show that barriers consistently impair performance, with mutual understanding reduced by over 45\% on average, and confusion elevated by nearly 50\%. Human evaluations validate the fidelity of these simulated barriers (ICC$\approx$0.78, Pearson r$\approx$0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier-free performance. This work takes a step toward bringing social interaction environments closer to real-world communication, opening opportunities for exploring the social intelligence of LLM agents.

84. 【2602.05107】Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text

链接https://arxiv.org/abs/2602.05107

作者:Ahmed Ruby,Christian Hardmeier,Sara Stymne

类目:Computation and Language (cs.CL)

关键词:requires inferring meaning, Implicit discourse relation, discourse relation classification, Implicit discourse, implicit discourse relations

备注

点击查看摘要

Abstract:Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.

85. 【2602.05106】Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models

链接https://arxiv.org/abs/2602.05106

作者:Michael Browder,Kevin Duh,J. David Harris,Vince Lyzinski,Paul McNamee,Youngser Park,Carey E. Priebe,Peter Viechnicki

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:labeled training data, training data remains, building performant language, data scarcity problem, performant language technology

备注

点击查看摘要

Abstract:Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.

86. 【2602.05085】Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

链接https://arxiv.org/abs/2602.05085

作者:Sidi Lu,Zhenwen Liang,Dongyang Ma,Yan Wang,Haitao Mi,Dong Yu

类目:Computation and Language (cs.CL)

关键词:aim to bridge, model parameters, Locally-Supported parametric memory, parametric memory, flexibly offloaded

备注: Tencent AI Lab Technical Report

点击查看摘要

Abstract:In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

87. 【2602.05060】StagePilot: A Deep Reinforcement Learning Agent for Stage-Controlled Cybergrooming Simulation

链接https://arxiv.org/abs/2602.05060

作者:Heajun An,Qi Zhang,Minqian Liu,Xinyi Zhang,Sang Won Lee,Lifu Huang,Pamela J. Wisniewski,Jin-Hee Cho

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:proactive educational interventions, necessitating proactive educational, threat to youth, necessitating proactive, educational interventions

备注

点击查看摘要

Abstract:Cybergrooming is an evolving threat to youth, necessitating proactive educational interventions. We propose StagePilot, an offline RL-based dialogue agent that simulates the stage-wise progression of grooming behaviors for prevention training. StagePilot selects conversational stages using a composite reward that balances user sentiment and goal proximity, with transitions constrained to adjacent stages for realism and interpretability. We evaluate StagePilot through LLM-based simulations, measuring stage completion, dialogue efficiency, and emotional engagement. Results show that StagePilot generates realistic and coherent conversations aligned with grooming dynamics. Among tested methods, the IQL+AWAC agent achieves the best balance between strategic planning and emotional coherence, reaching the final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.

88. 【2602.05056】VEXA: Evidence-Grounded and Persona-Adaptive Explanations for Scam Risk Sensemaking

链接https://arxiv.org/abs/2602.05056

作者:Heajun An,Connor Ng,Sandesh Sharma Dulal,Junghwan Kim,Jin-Hee Cho

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:short message services, everyday risk assessment, social media increasingly, media increasingly challenge, increasingly challenge everyday

备注

点击查看摘要

Abstract:Online scams across email, short message services, and social media increasingly challenge everyday risk assessment, particularly as generative AI enables more fluent and context-aware deception. Although transformer-based detectors achieve strong predictive performance, their explanations are often opaque to non-experts or misaligned with model decisions. We propose VEXA, an evidence-grounded and persona-adaptive framework for generating learner-facing scam explanations by integrating GradientSHAP-based attribution with theory-informed vulnerability personas. Evaluation across multi-channel datasets shows that grounding explanations in detector-derived evidence improves semantic reliability without increasing linguistic complexity, while persona conditioning introduces interpretable stylistic variation without disrupting evidential alignment. These results reveal a key design insight: evidential grounding governs semantic correctness, whereas persona-based adaptation operates at the level of presentation under constraints of faithfulness. Together, VEXA demonstrates the feasibility of persona-adaptive, evidence-grounded explanations and provides design guidance for trustworthy, learner-facing security explanations in non-formal contexts.

89. 【2602.05035】Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation

链接https://arxiv.org/abs/2602.05035

作者:Sean Trott,Pamela D. Rivière

类目:Computation and Language (cs.CL)

关键词:possibly due, English and Spanish, multilingual LMs, Multilingual, Multilingual language models

备注: 9 pages, 5 figures, conference

点击查看摘要

Abstract:Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty'' for lexical disambiguation--a task requiring precise semantic representations and contextualization mechanisms--using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model's multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.

90. 【2602.05014】DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

链接https://arxiv.org/abs/2602.05014

作者:Zhanli Li,Huiwen Tian,Lvzhou Luo,Yixuan Cao,Ping Luo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:decision-driven evidence acquisition, Retrieval-Augmented Generation, agentic large language, large language models, evolving from one-shot

备注: working in progress

点击查看摘要

Abstract:With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read'' behavior.

91. 【2602.05006】Enhanced QKNorm normalization for neural transformers with the Lp norm

链接https://arxiv.org/abs/2602.05006

作者:Ezequiel Lopez-Rubio,Javier Montes-Perez,Esteban Jose Palomo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Transformer architecture, query and key, essential part, key vectors, Transformer

备注

点击查看摘要

Abstract:The normalization of query and key vectors is an essential part of the Transformer architecture. It ensures that learning is stable regardless of the scale of these vectors. Some normalization approaches are available. In this preliminary work, a generalization of the QKNorm normalization scheme is proposed. The approach is based on the Lp norm, allowing non-Euclidean norms to be employed. Experimental results demonstrate the suitability of the method for a simple problem.

92. 【2602.05004】CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

链接https://arxiv.org/abs/2602.05004

作者:Zexin Lin,Jiachen Yu,Haoyang Zhang,Yuzhao Li,Zhonghang Li,Yujiu Yang,Junjie Wang,Xiaoqiang Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, sub-second real-time coordination, enabling language-conditioned agents, sustained multi-episode adaptation, Large language

备注

点击查看摘要

Abstract:Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast--slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.

93. 【2602.05000】EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

链接https://arxiv.org/abs/2602.05000

作者:Atula Tejaswi,Litu Rout,Constantine Caramanis,Sanjay Shakkottai,Sujay Sanghavi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reward guidance, Reward, downstream reward model, reward model, applied to great

备注: Preprint

点击查看摘要

Abstract:Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.

94. 【2602.04998】Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

链接https://arxiv.org/abs/2602.04998

作者:Yu-Ang Lee,Ching-Yun Ko,Pin-Yu Chen,Mi-Yen Yeh

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:efficient large language, Low-Rank Adaptation, large language model, prevailing approach, approach for efficient

备注

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies and architectural modifications, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches. Across mathematical and code generation tasks on diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

95. 【2602.04982】BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations

链接https://arxiv.org/abs/2602.04982

作者:Deepak Gupta,Davis Bartels,Dina Demner-Fuhsman

类目:Computation and Language (cs.CL)

关键词:generated answers, large language models, large language, answers, generated

备注: Work in progress

点击查看摘要

Abstract:With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (this https URL) evaluation package.

96. 【2602.04937】Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

链接https://arxiv.org/abs/2602.04937

作者:Davide Berasi,Matteo Farina,Massimiliano Mancini,Elisa Ricci

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, successful Supervised Fine-Tuning, Large Language, Supervised Fine-Tuning

备注: Preprint

点击查看摘要

Abstract:Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at this https URL.

97. 【2602.04926】Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.04926

作者:Ning Wang,Kuanyan Zhu,Daniel Yuehwoon Yee,Yitang Gao,Shiying Huang,Zirun Xu,Sainyam Galhotra

类目:Databases (cs.DB); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:repeatedly re-retrieving long, re-retrieving long passages, knowledge-intensive LLM tasks, graph-style RAG system, query as fresh

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective $k$-means once a memory threshold is reached) and prune low-value structure, while prompts retain only overlap representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an LLM extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9--11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.

98. 【2602.04925】Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

链接https://arxiv.org/abs/2602.04925

作者:Zhenning Shi,Yijia Zhu,Junhan Shi,Xun Zhang,Lei Wang,Congcong Miao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:scaling test-time compute, highly efficient paradigm, processes into hidden, test-time compute, hidden states

备注

点击查看摘要

Abstract:The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at this https URL.

99. 【2602.04918】Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution

链接https://arxiv.org/abs/2602.04918

作者:Long Zhang,Fangwei Lin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large Language Models, pre-existing parametric memory, Large Language, frequently prioritize conflicting, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the "Manifold Dilution" hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by "Orthogonal Interference," where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not "unlearn" or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.

100. 【2602.04912】Atomic Information Flow: A Network Flow Model for Tool Attributions in RAG Systems

链接https://arxiv.org/abs/2602.04912

作者:James Gao,Josh Zhou,Qi Sun,Ryan Huang,Steven Yoo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Retrieval Augmented Generation, Augmented Generation, tool-based Retrieval Augmented, complex multi-agent architectures, systems lack precise

备注

点击查看摘要

Abstract:Many tool-based Retrieval Augmented Generation (RAG) systems lack precise mechanisms for tracing final responses back to specific tool components -- a critical gap as systems scale to complex multi-agent architectures. We present \textbf{Atomic Information Flow (AIF)}, a graph-based network flow model that decomposes tool outputs and LLM calls into atoms: indivisible, self-contained units of information. By modeling LLM orchestration as a directed flow of atoms from tool and LLM nodes to a response super-sink, AIF enables granular attribution metrics for AI explainability. Motivated by the max-flow min-cut theorem in network flow theory, we train a lightweight Gemma3 (4B parameter) language model as a context compressor to approximate the minimum cut of tool atoms using flow signals computed offline by AIF. We note that the base Gemma3-4B model struggles to identify critical information with \textbf{54.7\%} accuracy on HotpotQA, barely outperforming lexical baselines (BM25). However, post-training on AIF signals boosts accuracy to \textbf{82.71\%} (+28.01 points) while achieving \textbf{87.52\%} (+1.85\%) context token compression -- bridging the gap with the Gemma3-27B variant, a model nearly $7\times$ larger.

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.04912 [cs.IR]

(or
arXiv:2602.04912v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.04912

Focus to learn more

              arXiv-issued DOI via DataCite</p>
101. 【2602.04916】AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

链接https://arxiv.org/abs/2602.04916

作者:Ling Luo,Wenbin Jiang,Xushi Zhang,Hongyuan Chang,Xinkang Wang,Yueting Xiong,Mengsha Tong,Rongshan Yu

类目:Quantitative Methods (q-bio.QM); Computation and Language (cs.CL)

关键词:Large language models, protein representation learning, significantly advanced protein, advanced protein representation, Large language

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

信息检索

1. 【2602.05975】SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

链接https://arxiv.org/abs/2602.05975

作者:Tiansheng Hu,Yilun Zhao,Canyu Zhang,Arman Cohan,Chen Zhao

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:addressing complex queries, Deep research agents, Deep research, emerged as powerful, addressing complex

备注: Submission to ACL ARR 2026 January

点击查看摘要

Abstract:Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval this http URL evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

2. 【2602.05945】AgenticTagger: Structured Item Representation for Recommendation with LLM Agents

链接https://arxiv.org/abs/2602.05945

作者:Zhouhang Xie,Bo Peng,Zhankui He,Ziqi Chen,Alice Han,Isabella Ye,Benjamin Coleman,Noveen Sachdeva,Fernando Pereira,Julian McAuley,Wang-Cheng Kang,Derek Zhiyuan Cheng,Beidou Wang,Randolph Brown

类目:Information Retrieval (cs.IR)

关键词:requirement for effective, vocabulary, core requirement, Abstract, generation

备注

点击查看摘要

Abstract:High-quality representations are a core requirement for effective recommendation. In this work, we study the problem of LLM-based descriptor generation, i.e., keyphrase-like natural language item representation generation frameworks with minimal constraints on downstream applications. We propose AgenticTagger, a framework that queries LLMs for representing items with sequences of text descriptors. However, open-ended generation provides little control over the generation space, leading to high cardinality, low-performance descriptors that renders downstream modeling challenging. To this end, AgenticTagger features two core stages: (1) a vocabulary building stage where a set of hierarchical, low-cardinality, and high-quality descriptors is identified, and (2) a vocabulary assignment stage where LLMs assign in-vocabulary descriptors to items. To effectively and efficiently ground vocabulary in the item corpus of interest, we design a multi-agent reflection mechanism where an architect LLM iteratively refines the vocabulary guided by parallelized feedback from annotator LLMs that validates the vocabulary against item data. Experiments on public and private data show AgenticTagger brings consistent improvements across diverse recommendation scenarios, including generative and term-based retrieval, ranking, and controllability-oriented, critique-based recommendation.

3. 【2602.05787】Bagging-Based Model Merging for Robust General Text Embeddings

链接https://arxiv.org/abs/2602.05787

作者:Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Wenbo Yang,Daiting Shi,Xueqi Cheng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:information retrieval applications, General-purpose text embedding, large-scale multi-task corpora, range of NLP, NLP and information

备注: 12 pages, 4 figures

点击查看摘要

Abstract:General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (\modelname), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, \modelname naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that \modelname consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

4. 【2602.05735】CSRv2: Unlocking Ultra-Sparse Embeddings

链接https://arxiv.org/abs/2602.05735

作者:Lixuan Guo,Yifei Wang,Tiansheng Wen,Yifan Wang,Aosong Feng,Bo Chen,Stefanie Jegelka,Chenyu You

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT)

关键词:large foundation models, downstream task performance, foundation models, dense embeddings, era of large

备注: Accepted by ICLR2026

点击查看摘要

Abstract:In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

5. 【2602.05734】Evaluating the impact of word embeddings on similarity scoring in practical information retrieval

链接https://arxiv.org/abs/2602.05734

作者:Niall McCarroll,Kevin Curran,Eugene McNamee,Angela Clist,Andrew Brammer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:search information based, Search behaviour, search information, based on meaning, behaviour is characterised

备注

点击查看摘要

Abstract:Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.05734 [cs.IR]

(or
arXiv:2602.05734v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.05734

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2602.05663】GLASS: A Generative Recommender for Long-sequence Modeling via SID-Tier and Semantic Search

链接https://arxiv.org/abs/2602.05663

作者:Shiteng Cao,Junda She,Ji Liu,Bin Zeng,Chengcheng Guo,Kuo Cai,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Zhiheng Li,Cheng Yang

类目:Information Retrieval (cs.IR)

关键词:Leveraging long-term user, modern recommender systems, user behavioral patterns, Leveraging long-term, recommender systems

备注: 10 pages,3 figures

点击查看摘要

Abstract:Leveraging long-term user behavioral patterns is a key trajectory for enhancing the accuracy of modern recommender systems. While generative recommender systems have emerged as a transformative paradigm, they face hurdles in effectively modeling extensive historical sequences. To address this challenge, we propose GLASS, a novel framework that integrates long-term user interests into the generative process via SID-Tier and Semantic Search. We first introduce SID-Tier, a module that maps long-term interactions into a unified interest vector to enhance the prediction of the initial SID token. Unlike traditional retrieval models that struggle with massive item spaces, SID-Tier leverages the compact nature of the semantic codebook to incorporate cross features between the user's long-term history and candidate semantic codes. Furthermore, we present semantic hard search, which utilizes generated coarse-grained semantic ID as dynamic keys to extract relevant historical behaviors, which are then fused via an adaptive gated fusion module to recalibrate the trajectory of subsequent fine-grained tokens. To address the inherent data sparsity in semantic hard search, we propose two strategies: semantic neighbor augmentation and codebook resizing. Extensive experiments on two large-scale real-world datasets, TAOBAO-MM and KuaiRec, demonstrate that GLASS outperforms state-of-the-art baselines, achieving significant gains in recommendation quality. Our codes are made publicly available to facilitate further research in generative recommendation.

7. 【2602.05512】A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

链接https://arxiv.org/abs/2602.05512

作者:Larissa Pusch,Alexandre Courtiol,Tim Conrad

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, knowledge-intensive domains due, outdated information, limited explainability, Large Language

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.

8. 【2602.05474】LMMRec: LLM-driven Motivation-aware Multimodal Recommendation

链接https://arxiv.org/abs/2602.05474

作者:Yicheng Di,Zhanjie Zhang,Yun Wangc,Jinren Liue,Jiaqi Yanf,Jiyu Wei,Xiangyu Chend,Yuan Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Motivation-based recommendation systems, recommendation systems uncover, Motivation-based recommendation, user behavior drivers, systems uncover user

备注

点击查看摘要

Abstract:Motivation-based recommendation systems uncover user behavior drivers. Motivation modeling, crucial for decision-making and content preference, explains recommendation generation. Existing methods often treat motivation as latent variables from interaction data, neglecting heterogeneous information like review text. In multimodal motivation fusion, two challenges arise: 1) achieving stable cross-modal alignment amid noise, and 2) identifying features reflecting the same underlying motivation across modalities. To address these, we propose LLM-driven Motivation-aware Multimodal Recommendation (LMMRec), a model-agnostic framework leveraging large language models for deep semantic priors and motivation understanding. LMMRec uses chain-of-thought prompting to extract fine-grained user and item motivations from text. A dual-encoder architecture models textual and interaction-based motivations for cross-modal alignment, while Motivation Coordination Strategy and Interaction-Text Correspondence Method mitigate noise and semantic drift through contrastive learning and momentum updates. Experiments on three datasets show LMMRec achieves up to a 4.98\% performance improvement.

9. 【2602.05445】Forward Index Compression for Learned Sparse Retrieval

链接https://arxiv.org/abs/2602.05445

作者:Sebastian Bruch,Martino Fontana,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini

类目:Information Retrieval (cs.IR)

关键词:highly effective approach, learned sparse representations, Text retrieval, queries and documents, effective approach

备注

点击查看摘要

Abstract:Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.

10. 【2602.05413】SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

链接https://arxiv.org/abs/2602.05413

作者:Filip Kučera,Christoph Mandl,Isao Echizen,Radu Timofte,Timo Spinde

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:publication numbers, significant increase, increase in publication, gathering definitions relevant, Definitions

备注: Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code datasets are available at this https URL.

Comments:
Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL)

Cite as:
arXiv:2602.05413 [cs.IR]

(or
arXiv:2602.05413v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.05413

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2602.05408】Rich-Media Re-Ranker: A User Satisfaction-Driven LLM Re-ranking Framework for Rich-Media Search

链接https://arxiv.org/abs/2602.05408

作者:Zihao Guo,Ligang Zhou,Zeyang Tang,Feicheng Li,Ying Nie,Zhiming Peng,Qingyun Sun,Jianxin Li

类目:Information Retrieval (cs.IR)

关键词:user search satisfaction, modern information search, satisfy user information, initial search results, plays a crucial

备注

点击查看摘要

Abstract:Re-ranking plays a crucial role in modern information search systems by refining the ranking of initial search results to better satisfy user information needs. However, existing methods show two notable limitations in improving user search satisfaction: inadequate modeling of multifaceted user intents and neglect of rich side information such as visual perception signals. To address these challenges, we propose the Rich-Media Re-Ranker framework, which aims to enhance user search satisfaction through multi-dimensional and fine-grained modeling. Our approach begins with a Query Planner that analyzes the sequence of query refinements within a session to capture genuine search intents, decomposing the query into clear and complementary sub-queries to enable broader coverage of users' potential intents. Subsequently, moving beyond primary text content, we integrate richer side information of candidate results, including signals modeling visual content generated by the VLM-based evaluator. These comprehensive signals are then processed alongside carefully designed re-ranking principle that considers multiple facets, including content relevance and quality, information gain, information novelty, and the visual presentation of cover images. Then, the LLM-based re-ranker performs the holistic evaluation based on these principles and integrated signals. To enhance the scenario adaptability of the VLM-based evaluator and the LLM-based re-ranker, we further enhance their capabilities through multi-task reinforcement learning. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines. Notably, the proposed framework has been deployed in a large-scale industrial search system, yielding substantial improvements in online user engagement rates and satisfaction metrics.

12. 【2602.05366】Multi-Field Tool Retrieval

链接https://arxiv.org/abs/2602.05366

作者:Yichen Tang,Weihang Su,Yiqun Liu,Qingyao Ai

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large Language Models, enables Large Language, Integrating external tools, Language Models, tools enables Large

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Integrating external tools enables Large Language Models (LLMs) to interact with real-world environments and solve complex tasks. Given the growing scale of available tools, effective tool retrieval is essential to mitigate constraints of LLMs' context windows and ensure computational efficiency. Existing approaches typically treat tool retrieval as a traditional ad-hoc retrieval task, matching user queries against the entire raw tool documentation. In this paper, we identify three fundamental challenges that limit the effectiveness of this paradigm: (i) the incompleteness and structural inconsistency of tool documentation; (ii) the significant semantic and granular mismatch between user queries and technical tool documents; and, most importantly, (iii) the multi-aspect nature of tool utility, that involves distinct dimensions, such as functionality, input constraints, and output formats, varying in format and importance. To address these challenges, we introduce Multi-Field Tool Retrieval, a framework designed to align user intent with tool representations through fine-grained, multi-field modeling. Experimental results show that our framework achieves SOTA performance on five datasets and a mixed benchmark, exhibiting superior generalizability and robustness.

13. 【2602.05334】NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

链接https://arxiv.org/abs/2602.05334

作者:Dawn Lawrie,James Mayfield,Eugene Yang,Andrew Yates,Sean MacAvaney,Ronak Pradeep,Scott Miller,Paul McNamee,Luca Soldaini

类目:Information Retrieval (cs.IR)

关键词:Measuring advances, requires test collections, retrieval requires test, requires test, Measuring

备注: 14 pages, 6 figures

点击查看摘要

Abstract:Measuring advances in retrieval requires test collections with relevance judgments that can faithfully distinguish systems. This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information. The collection consists of technical documents written natively in Chinese and those same documents machine translated into English. It includes 110 queries with relevance judgments. The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language. NeuCLIRTech combines the TREC NeuCLIR track topics of 2023 and 2024. The 110 queries with 35,962 document judgments provide strong statistical discriminatory power when trying to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included so that developers of reranking algorithms are not reliant on BM25 as their first stage retriever. The dataset and artifacts are released on Huggingface Datasets

14. 【2602.05216】Semantic Search over 9 Million Mathematical Theorems

链接https://arxiv.org/abs/2602.05216

作者:Luke Alexander,Eric Leonen,Sophie Szeto,Artemii Remizov,Ignacio Tejeda,Giovanni Inchiostro,Vasily Ilin

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); History and Overview (math.HO)

关键词:retrieve entire papers, results remains difficult, mathematical results remains, tools retrieve entire, entire papers

备注: Feedback is welcome

点击查看摘要

Abstract:Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research-level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of $9.2$ million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human-authored, research-level theorems. We represent each theorem with a short natural-language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem-search queries written by professional mathematicians, our approach substantially improves both theorem-level and paper-level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale. The theorem search tool is available at \href{this https URL}{this link}, and the dataset is available at \href{this https URL}{this link}.

15. 【2602.05152】RAG without Forgetting: Continual Query-Infused Key Memory

链接https://arxiv.org/abs/2602.05152

作者:Yuntong Hu,Sha Li,Naren Ramakrishnan,Liang Zhao

类目:Information Retrieval (cs.IR)

关键词:systems commonly improve, commonly improve robustness, Retrieval-augmented generation, systems commonly, commonly improve

备注: 24 pages, 12 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems commonly improve robustness via query-time adaptations such as query expansion and iterative retrieval. While effective, these approaches are inherently stateless: adaptations are recomputed for each query and discarded thereafter, precluding cumulative learning and repeatedly incurring inference-time cost. Index-side approaches like key expansion introduce persistence but rely on offline preprocessing or heuristic updates that are weakly aligned with downstream task utility, leading to semantic drift and noise accumulation. We propose Evolving Retrieval Memory (ERM), a training-free framework that transforms transient query-time gains into persistent retrieval improvements. ERM updates the retrieval index through correctness-gated feedback, selectively attributes atomic expansion signals to the document keys they benefit, and progressively evolves keys via stable, norm-bounded updates. We show that query and key expansion are theoretically equivalent under standard similarity functions and prove convergence of ERM's selective updates, amortizing optimal query expansion into a stable index with zero inference-time overhead. Experiments on BEIR and BRIGHT across 13 domains demonstrate consistent gains in retrieval and generation, particularly on reasoning-intensive tasks, at native retrieval speed.

16. 【2602.05143】HugRAG: Hierarchical Causal Knowledge Graph Design for RAG

链接https://arxiv.org/abs/2602.05143

作者:Nengbo Wang,Tuo Liang,Vikash Singh,Chaoda Song,Van Yang,Yu Yin,Jing Ma,Jagdip Singh,Vipin Chaudhary

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Retrieval augmented generation, enhanced large language, Retrieval augmented, large language models, augmented generation

备注

点击查看摘要

Abstract:Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graph-based RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.

17. 【2602.05087】Autodiscover: A reinforcement learning recommendation system for the cold-start imbalance challenge in active learning, powered by graph-aware thompson sampling

链接https://arxiv.org/abs/2602.05087

作者:Parsa Vares

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:scientific output grows, Systematic literature reviews, Systematic literature, evidence-based research, output grows

备注: Master's Thesis, University of Luxembourg in collaboration with Luxembourg Institute of Science and Technology (LIST). Supervised by Prof. Jun Pang and Dr. Eloi Durant

点击查看摘要

Abstract:Systematic literature reviews (SLRs) are fundamental to evidence-based research, but manual screening is an increasing bottleneck as scientific output grows. Screening features low prevalence of relevant studies and scarce, costly expert decisions. Traditional active learning (AL) systems help, yet typically rely on fixed query strategies for selecting the next unlabeled documents. These static strategies do not adapt over time and ignore the relational structure of scientific literature networks. This thesis introduces AutoDiscover, a framework that reframes AL as an online decision-making problem driven by an adaptive agent. Literature is modeled as a heterogeneous graph capturing relationships among documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) learns node representations, which a Discounted Thompson Sampling (DTS) agent uses to dynamically manage a portfolio of query strategies. With real-time human-in-the-loop labels, the agent balances exploration and exploitation under non-stationary review dynamics, where strategy utility changes over time. On the 26-dataset SYNERGY benchmark, AutoDiscover achieves higher screening efficiency than static AL baselines. Crucially, the agent mitigates cold start by bootstrapping discovery from minimal initial labels where static approaches fail. We also introduce TS-Insight, an open-source visual analytics dashboard to interpret, verify, and diagnose the agent's decisions. Together, these contributions accelerate SLR screening under scarce expert labels and low prevalence of relevant studies.

18. 【2602.05062】Scaling Laws for Embedding Dimension in Information Retrieval

链接https://arxiv.org/abs/2602.05062

作者:Julian Killingback,Mahta Rafiee,Madine Manas,Hamed Zamani

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:nearest neighbor algorithms, fast approximate nearest, approximate nearest neighbor, embedding dimension, neural retrieval approach

备注: 9 Pages, 7 figures

点击查看摘要

Abstract:Dense retrieval, which encodes queries and documents into a single dense vector, has become the dominant neural retrieval approach due to its simplicity and compatibility with fast approximate nearest neighbor algorithms. As the tasks dense retrieval performs grow in complexity, the fundamental limitations of the underlying data structure and similarity metric -- namely vectors and inner-products -- become more apparent. Prior recent work has shown theoretical limitations inherent to single vectors and inner-products that are generally tied to the embedding dimension. Given the importance of embedding dimension for retrieval capacity, understanding how dense retrieval performance changes as embedding dimension is scaled is fundamental to building next generation retrieval models that balance effectiveness and efficiency. In this work, we conduct a comprehensive analysis of the relationship between embedding dimension and retrieval performance. Our experiments include two model families and a range of model sizes from each to construct a detailed picture of embedding scaling behavior. We find that the scaling behavior fits a power law, allowing us to derive scaling laws for performance given only embedding dimension, as well as a joint law accounting for embedding dimension and model size. Our analysis shows that for evaluation tasks aligned with the training task, performance continues to improve as embedding size increases, though with diminishing returns. For evaluation data that is less aligned with the training task, we find that performance is less predictable, with performance degrading with larger embedding dimensions for certain tasks. We hope our work provides additional insight into the limitations of embeddings and their behavior as well as offers a practical guide for selecting model and embedding dimension to achieve optimal performance with reduced storage and compute costs.

19. 【2602.05014】DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

链接https://arxiv.org/abs/2602.05014

作者:Zhanli Li,Huiwen Tian,Lvzhou Luo,Yixuan Cao,Ping Luo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:decision-driven evidence acquisition, Retrieval-Augmented Generation, agentic large language, large language models, evolving from one-shot

备注: working in progress

点击查看摘要

Abstract:With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read'' behavior.

20. 【2602.04936】Deterministic Retrieval at Scale: Optimal-Space LCP Indexing and 308x Energy Reduction on Modern GPUs

链接https://arxiv.org/abs/2602.04936

作者:Stanislav Byriukov

类目:Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

关键词:Longest Common Prefix, Longest Common, study deterministic top-k, Common Prefix, sequences of length

备注

点击查看摘要

Abstract:We study deterministic top-k retrieval under Longest Common Prefix (LCP) similarity for N sequences of length L. We prove a tight Omega(N) space lower bound (cell-probe model) and present a trie-based index using O(N*L) space with O(L+k) query time. We contrast this with pairwise materialization (Theta(N^2)), which hits a practical OOM wall at scale, while our indexed approach remains O(N) in memory. We then introduce Thermal-Aware Logic (TAL), which turns prefix structure into range-bounded scans. In hardware measurements, TAL reduces energy per query by 308x (0.0145 J vs 4.46 J) and cuts p95 latency by 329x (0.114 ms vs 37.5 ms) on a 20M-item range-scan benchmark, while sustaining near-peak utilization (~99%) under long runs. The result is a deterministic retrieval primitive with receipts in regimes where approximate methods are unacceptable.

21. 【2602.04912】Atomic Information Flow: A Network Flow Model for Tool Attributions in RAG Systems

链接https://arxiv.org/abs/2602.04912

作者:James Gao,Josh Zhou,Qi Sun,Ryan Huang,Steven Yoo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Retrieval Augmented Generation, Augmented Generation, tool-based Retrieval Augmented, complex multi-agent architectures, systems lack precise

备注

点击查看摘要

Abstract:Many tool-based Retrieval Augmented Generation (RAG) systems lack precise mechanisms for tracing final responses back to specific tool components -- a critical gap as systems scale to complex multi-agent architectures. We present \textbf{Atomic Information Flow (AIF)}, a graph-based network flow model that decomposes tool outputs and LLM calls into atoms: indivisible, self-contained units of information. By modeling LLM orchestration as a directed flow of atoms from tool and LLM nodes to a response super-sink, AIF enables granular attribution metrics for AI explainability. Motivated by the max-flow min-cut theorem in network flow theory, we train a lightweight Gemma3 (4B parameter) language model as a context compressor to approximate the minimum cut of tool atoms using flow signals computed offline by AIF. We note that the base Gemma3-4B model struggles to identify critical information with \textbf{54.7\%} accuracy on HotpotQA, barely outperforming lexical baselines (BM25). However, post-training on AIF signals boosts accuracy to \textbf{82.71\%} (+28.01 points) while achieving \textbf{87.52\%} (+1.85\%) context token compression -- bridging the gap with the Gemma3-27B variant, a model nearly $7\times$ larger.

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.04912 [cs.IR]

(or
arXiv:2602.04912v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.04912

Focus to learn more

              arXiv-issued DOI via DataCite</p>

计算机视觉

1. 【2602.06043】Shared LoRA Subspaces for almost Strict Continual Learning

链接https://arxiv.org/abs/2602.06043

作者:Prakhar Kaushik,Ankit Vaidya,Shravan Chaudhari,Rama Chellappa,Alan Yuille

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting large pretrained, remains challenging due, Adapting large, large pretrained models, cost of retraining

备注

点击查看摘要

Abstract:Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

2. 【2602.06042】Pseudo-Invertible Neural Networks

链接https://arxiv.org/abs/2602.06042

作者:Yamit Ehrlich,Nimrod Berman,Assaf Shocher

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Moore-Penrose Pseudo-inverse, Pseudo-invertible Neural Networks, Surjective Pseudo-invertible Neural, neural networks, Pseudo-inverse

备注

点击查看摘要

Abstract:The Moore-Penrose Pseudo-inverse (PInv) serves as the fundamental solution for linear systems. In this paper, we propose a natural generalization of PInv to the nonlinear regime in general and to neural networks in particular. We introduce Surjective Pseudo-invertible Neural Networks (SPNN), a class of architectures explicitly designed to admit a tractable non-linear PInv. The proposed non-linear PInv and its implementation in SPNN satisfy fundamental geometric properties. One such property is null-space projection or "Back-Projection", $x' = x + A^\dagger(y-Ax)$, which moves a sample $x$ to its closest consistent state $x'$ satisfying $Ax=y$. We formalize Non-Linear Back-Projection (NLBP), a method that guarantees the same consistency constraint for non-linear mappings $f(x)=y$ via our defined PInv. We leverage SPNNs to expand the scope of zero-shot inverse problems. Diffusion-based null-space projection has revolutionized zero-shot solving for linear inverse problems by exploiting closed-form back-projection. We extend this method to non-linear degradations. Here, "degradation" is broadly generalized to include any non-linear loss of information, spanning from optical distortions to semantic abstractions like classification. This approach enables zero-shot inversion of complex degradations and allows precise semantic control over generative outputs without retraining the diffusion prior.

3. 【2602.06041】Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

链接https://arxiv.org/abs/2602.06041

作者:Xuejun Zhang,Aditi Tiwari,Zhenhailong Wang,Heng Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current multimodal large, spatial reasoning remains, reasoning remains challenging, multimodal large language, Multi-image spatial reasoning

备注

点击查看摘要

Abstract:Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

4. 【2602.06040】SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

链接https://arxiv.org/abs/2602.06040

作者:Jintao Tong,Shilin Yan,Hongwei Xue,Xiaojun Tang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, made remarkable progress

备注: Project Page: [this https URL](https://accio-lab.github.io/SwimBird)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.

5. 【2602.06038】CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

链接https://arxiv.org/abs/2602.06038

作者:Xiaopan Zhang,Zejin Wang,Zhixu Li,Jianpeng Yao,Jiachen Li

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:manipulate target objects, answer relevant questions, Embodied Question Answering, complete assignments provided, natural language

备注: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: [this https URL](https://comm-cp.github.io/)

点击查看摘要

Abstract:To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: this https URL.

6. 【2602.06037】hinking with Geometry: Active Geometry Integration for Spatial Reasoning

链接https://arxiv.org/abs/2602.06037

作者:Haoyuan Li,Qihang Cao,Tao Tang,Kun Xiang,Zihan Guo,Jianhua Han,Hang Xu,Xiaodan Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Recent progress

备注

点击查看摘要

Abstract:Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at this https URL.

7. 【2602.06035】InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

链接https://arxiv.org/abs/2602.06035

作者:Sirui Xu,Samuel Schulter,Morteza Ziyadi,Xialin He,Xiaohan Fei,Yu-Xiong Wang,Liangyan Gui

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:Humans rarely plan, explicit whole-body movements, Humans rarely, rarely plan whole-body, rarely plan

备注: Webpage: [this https URL](https://sirui-xu.github.io/InterPrior/)

点击查看摘要

Abstract:Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.

8. 【2602.06034】V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

链接https://arxiv.org/abs/2602.06034

作者:Dongyang Chen,Chaoyang Wang,Dezhao SU,Xi Xiao,Zeyu Zhang,Jing Xiong,Qing Li,Yuzhang Shang,Shichao Ka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, improves candidate reranking

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual this http URL train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

9. 【2602.06032】Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

链接https://arxiv.org/abs/2602.06032

作者:David Shavin,Sagie Benaim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Foundation Models, Vision Foundation, achieved remarkable success, Foundation Models, achieved remarkable

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at this https URL

10. 【2602.06028】Context Forcing: Consistent Autoregressive Video Generation with Long Context

链接https://arxiv.org/abs/2602.06028

作者:Shuo Chen,Cong Wei,Sun Sun,Ping Nie,Kai Zhou,Ge Zhang,Ming-Hsuan Yang,Wenhu Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:streaming tuning strategies, typically employ streaming, employ streaming tuning, Recent approaches, generation typically employ

备注

点击查看摘要

Abstract:Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

11. 【2602.06017】MambaVF: State Space Model for Efficient Video Fusion

链接https://arxiv.org/abs/2602.06017

作者:Zixiang Zhao,Yukun Cui,Lilun Deng,Haowen Bai,Haotong Qin,Tao Feng,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video fusion, fundamental technique, video processing tasks, Video, fusion

备注

点击查看摘要

Abstract:Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: this https URL

12. 【2602.06013】GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

链接https://arxiv.org/abs/2602.06013

作者:Ruihang Li,Leigang Qu,Jingxu Zhang,Dongnan Gui,Mengde Xu,Xiaosong Zhang,Han Hu,Wenjie Wang,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visual generation, visual generation models, traditional evaluation approaches, outpaced traditional evaluation, necessitating the adoption

备注: Project Page: [this https URL](https://genarena.github.io/) , Code: [this https URL](https://github.com/ruihanglix/genarena)

点击查看摘要

Abstract:The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

13. 【2602.05998】VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

链接https://arxiv.org/abs/2602.05998

作者:Jie Deng,Kaichun Yao,Libo Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:translate user interface, user interface screenshots, executable frontend code, aims to translate, translate user

备注

点击查看摘要

Abstract:Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

14. 【2602.05986】RISE-Video: Can Video Generators Decode Implicit World Rules?

链接https://arxiv.org/abs/2602.05986

作者:Mingxin Liu,Shuran Ma,Shibei Meng,Xiangyu Zhao,Zicheng Zhang,Shaofeng Zhang,Zhihang Zhong,Peixian Chen,Haoyu Cao,Xing Sun,Haodong Duan,Xue Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:world rules remains, remarkable visual fidelity, achieved remarkable visual, implicit world rules, under-explored frontier

备注: 38 pages, 16 figures, 3 tables; Code: [this https URL](https://github.com/VisionXLab/RISE-Video;) HuggingFace: [this https URL](https://huggingface.co/datasets/VisionXLab/RISE-Video)

点击查看摘要

Abstract:While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

15. 【2602.05966】LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

链接https://arxiv.org/abs/2602.05966

作者:Mirlan Karimov,Teodora Spasojevic,Markus Braun,Julian Wiederer,Vasileios Belagiannis,Marc Pollefeys

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enabling realistic synthesis, Controllable video generation, Controllable video, video generation, autonomous driving

备注: Accepted to IEEE IV 2026. 8 pages, 3 figures. Code available at [this https URL](https://github.com/mirlanium/LSA)

点击查看摘要

Abstract:Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.

16. 【2602.05951】Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

链接https://arxiv.org/abs/2602.05951

作者:Junwan Kim,Jiho Park,Seonghu Jeon,Seungryong Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:diffusion-based generative models, recently emerged, promising alternative, alternative to diffusion-based, diffusion-based generative

备注: Project Page: [this https URL](https://junwankimm.github.io/CSFM)

点击查看摘要

Abstract:Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

17. 【2602.05937】Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation

链接https://arxiv.org/abs/2602.05937

作者:Lingrui Li,Yanfeng Zhou,Nan Pu,Xin Chen,Zhun Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Distribution shift, pre-trained semantic segmentation, clinical centers, significantly hindering, common challenge

备注: 8 pages, BIBM2025

点击查看摘要

Abstract:Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.

18. 【2602.05909】CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression

链接https://arxiv.org/abs/2602.05909

作者:Kangjie Zhang,Wenxuan Huang,Xin Zhou,Boxiang Zhou,Dejia Song,Yuan Xie,Baochang Zhang,Lizhuang Ma,Nemo Chen,Xu Tang,Yao Hu,Shaohui Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive Language-Image Pre-training, computer vision tasks, achieved widely applications, Image captioning, Contrastive Language-Image

备注

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.

19. 【2602.05884】Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views

链接https://arxiv.org/abs/2602.05884

作者:Gino E. Jansen,Carolina Brás,R. Nils Planken,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:quantitative analysis, analysis of anatomy, CTA, sparse CTA planes, CTA planes

备注

点击查看摘要

Abstract:Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 $\pm$ 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson's biplane rule: 4.88 $\pm$ 4.26 mL vs. 8.14 $\pm$ 6.04 mL, respectively, for the left ventricle; and 6.40 $\pm$ 7.37 mL vs. 37.76 $\pm$ 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.

20. 【2602.05882】EoCD: Encoder only Remote Sensing Change Detection

链接https://arxiv.org/abs/2602.05882

作者:Mubashir Noman,Mustansar Fiaz,Hiyam Debary,Abdul Hannan,Shah Nawaz,Fahad Shahbaz Khan,Salman Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern earth observation, change detection, change detection performance, change detection methods, change

备注

点击查看摘要

Abstract:Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network's complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.

21. 【2602.05880】Contour Refinement using Discrete Diffusion in Low Data Regime

链接https://arxiv.org/abs/2602.05880

作者:Fei Yu Guan,Ian Keefe,Sophie Wilkinson,Daniel D.B. Perrakis,Steven Waslander

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:situ computational resources, scarce labeled data, low data regime, Boundary detection, environmental monitoring

备注: CRV 2026, 8 pages, 6 figures

点击查看摘要

Abstract:Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size 500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.

22. 【2602.05871】Pathwise Test-Time Correction for Autoregressive Long Video Generation

链接https://arxiv.org/abs/2602.05871

作者:Xunzhi Xiang,Zixuan Duan,Guiyu Zhang,Haiyu Zhang,Zhe Gao,Junta Wu,Shaofeng Zhang,Tengfei Wang,Qi Fan,Chunchao Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe error accumulation, facilitate real-time short, real-time short video, short video synthesis, Distilled autoregressive diffusion

备注

点击查看摘要

Abstract:Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.

23. 【2602.05845】Self-Supervised Learning with a Multi-Task Latent Space Objective

链接https://arxiv.org/abs/2602.05845

作者:Pierre-François De Plaen,Abhishek Jha,Luc Van Gool,Tinne Tuytelaars,Marc Proesmans

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:networks learn visual, learn visual representations, Self-supervised learning, Siamese networks learn, methods based

备注

点击查看摘要

Abstract:Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.

24. 【2602.05832】UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

链接https://arxiv.org/abs/2602.05832

作者:Han Xiao,Guozhi Wang,Hao Wang,Shilong Liu,Yuxiang Chai,Yue Pan,Yufeng Zhou,Xiaoxin Chen,Yafei Wen,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Online Reinforcement Learning, Reinforcement Learning, direct environment interaction, offers a promising, environment interaction

备注: 23 pages, 16 figures. Project page: [this https URL](https://ui-mem.github.io)

点击查看摘要

Abstract:Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: this https URL

25. 【2602.05829】Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning

链接https://arxiv.org/abs/2602.05829

作者:Yudi Shi,Shangzhe Di,Qirui Chen,Qinian Wang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demands robust perceptual, interpretive skills, constitutes a comprehensive, comprehensive assessment, demands robust

备注

点击查看摘要

Abstract:Video reasoning constitutes a comprehensive assessment of a model's capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.

26. 【2602.05827】Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

链接https://arxiv.org/abs/2602.05827

作者:Hai Zhang,Siqi Liang,Li Chen,Yuxian Li,Yukuan Xu,Yichao Zhong,Fu Zhang,Hongyang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:bound to detailed, detailed and verbose, verbose language instructions, vision-language navigation, navigation

备注

点击查看摘要

Abstract:Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

27. 【2602.05822】NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects

链接https://arxiv.org/abs/2602.05822

作者:Musawar Ali,Manuel Carranza-García,Nicola Fioraio,Samuele Salti,Luigi Di Stefano

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complementary RGB sequences, RGB inputs, RGB sequences, propose NVS-HO, complementary RGB

备注

点击查看摘要

Abstract:We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.

28. 【2602.05809】Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning

链接https://arxiv.org/abs/2602.05809

作者:Enwei Tong,Yuanchao Bai,Yao Zhu,Junjun Jiang,Xianming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:greatly increase inference, increase inference latency, generate massive visual, balance local evidence, memory footprint

备注

点击查看摘要

Abstract:Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at this https URL

29. 【2602.05789】Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

链接https://arxiv.org/abs/2602.05789

作者:Hengyi Wang,Ruiqiang Zhang,Chang Liu,Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:receiving growing focus, Vision-Language Navigation, allocentric perception capabilities, spatially grounded tasks, growing focus

备注

点击查看摘要

Abstract:With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

30. 【2602.05785】ReText: Text Boosts Generalization in Image-Based Person Re-identification

链接https://arxiv.org/abs/2602.05785

作者:Timur Mamedov,Karina Kvanchiani,Anton Konushin,Vadim Konushin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Generalizable image-based person, Generalizable image-based, aims to recognize, recognize individuals, individuals across cameras

备注

点击查看摘要

Abstract:Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.

31. 【2602.05755】FMPose3D: monocular 3D pose estimation via flow matching

链接https://arxiv.org/abs/2602.05755

作者:Ti Wang,Xiaohang Yu,Mackenzie Weygandt Mathis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamentally ill-posed due, Ordinary Differential Equation, motivating probabilistic methods, ambiguity and occlusions, fundamentally ill-posed

备注

点击查看摘要

Abstract:Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at this https URL.

32. 【2602.05737】Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing

链接https://arxiv.org/abs/2602.05737

作者:Luca Ciampi,Ludovico Iannello,Fabrizio Tonelli,Gabriele Lagani,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato

类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:cultured cortical neurons, cortical neurons serves, neurons serves, living neural, reservoir computing

备注

点击查看摘要

Abstract:In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.

33. 【2602.05730】Depth as Prior Knowledge for Object Detection

链接https://arxiv.org/abs/2602.05730

作者:Moussa Kassem Sbeyti,Nadja Klein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distant objects remains, objects remains challenging, low resolution, scale variation, background clutter

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at this https URL.

34. 【2602.05729】Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

链接https://arxiv.org/abs/2602.05729

作者:Lexiang Hu,Youze Xue,Dian Li,Gang Liu,Zhouchen Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal embeddings serve, MLLM-based embedding models, global semantic information, vision and language, primary implementations

备注

点击查看摘要

Abstract:Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.

35. 【2602.05718】Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

链接https://arxiv.org/abs/2602.05718

作者:Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lightly frame-annotated paradigm, locate action instances, effectively locate action, Point-supervised Temporal Action, Action

备注

点击查看摘要

Abstract:Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.

36. 【2602.05710】Ethology of Latent Spaces

链接https://arxiv.org/abs/2602.05710

作者:Philippe Boisnard

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:study challenges, challenges the presumed, presumed neutrality, adopting an ethological, ethological perspective

备注: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( [this https URL](https://paragraphe.univ-paris8.fr/IMG/pdf/programme_colloque_his9_campuscondorcet_v3.pdf) ) and accepted for publication in double-blind peer review in French in 2026-2027

点击查看摘要

Abstract:This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices. Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points. We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault's notion of the archive, Jameson's ideologeme, and Simondon's theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents.

Comments:
23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( this https URL ) and accepted for publication in double-blind peer review in French in 2026-2027

Subjects:

Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.05710 [cs.CY]

(or
arXiv:2602.05710v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2602.05710

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
37. 【2602.05706】Poster: Camera Tampering Detection for Outdoor IoT Systems

链接https://arxiv.org/abs/2602.05706

作者:Shadi Attarha,Kanaga Shanmugi,Anna Förster

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:surveillance and security, outdoor settings, settings has grown, grown to improve, improve surveillance

备注: Proceedings of the 2024 INTERNATIONAL CONFERENCE ON EMBEDDED WIRELESS SYSTEMS AND NETWORKS (EWSN)

点击查看摘要

Abstract:Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.

38. 【2602.05676】ShapeUP: Scalable Image-Conditioned 3D Editing

链接https://arxiv.org/abs/2602.05676

作者:Inbar Gat,Dana Cohen-Bar,Guy Levy,Elad Richardson,Daniel Cohen-Or

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent advancements, significant challenge, enabled the generation, generation of high-fidelity, remains a significant

备注

点击查看摘要

Abstract:Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.

39. 【2602.05650】Enhancing Personality Recognition by Comparing the Predictive Power of Traits, Facets, and Nuances

链接https://arxiv.org/abs/2602.05650

作者:Amir Ansari,Jana Subirana,Bruna Silva,Sergio Escalera,David Gallardo-Pujol,Cristina Palmero

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:construct typically assessed, item-level questionnaires aggregated, hierarchical construct typically, broad trait scores, trait scores

备注: Accepted to the 2025 13th International Conference on Affective Computing and Intelligent Interaction (Late Breaking Results)

点击查看摘要

Abstract:Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.

40. 【2602.05638】UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

链接https://arxiv.org/abs/2602.05638

作者:Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current approaches rely, low-level visual details, approaches rely predominantly, semantic structures essential, waste model capacity

备注

点击查看摘要

Abstract:While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.

41. 【2602.05629】ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing

链接https://arxiv.org/abs/2602.05629

作者:Jianlei Chi,Yuzhen Wu,Jiaxuan Hou,Xiaodong Zhang,Ming Fan,Suhui Sun,Weijun Dai,Bo Li,Jianguo Sun,Jun Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated Driving System, Automated Driving, Driving System, brain of autonomous, Automated

备注: The manuscript includes 13 pages, 8 tables, and 7 figures

点击查看摘要

Abstract:Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.

42. 【2602.05617】Unified Sensor Simulation for Autonomous Driving

链接https://arxiv.org/abs/2602.05617

作者:Nikolay Patakin,Arsenii Shirokov,Anton Konushin,Dmitry Senushkin

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:autonomous driving, autonomous driving applications, autonomous driving datasets, sensor simulation framework, XSIM extends

备注

点击查看摘要

Abstract:In this work, we introduce \textbf{XSIM}, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \href{this https URL}{this https URL}.

43. 【2602.05605】Shiva-DiT: Residual-Based Differentiable Top-$k$ Selection for Efficient Diffusion Transformers

链接https://arxiv.org/abs/2602.05605

作者:Jiaji Zhang,Hailiang Zhao,Guoxuan Zhu,Ruichao Sun,Jiaju Wu,Xinkui Zhao,Hanlin Tang,Weiyi Lu,Kan Liu,Tao Lan,Lin Qu,Shuiguang Deng

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, incur prohibitive computational, prohibitive computational costs, computational costs due, incur prohibitive

备注

点击查看摘要

Abstract:Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing pruning methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top-$k$ Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while preserving end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive pruning schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54$\times$ wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.

44. 【2602.05602】Multi-instance robust fitting for non-classical geometric models

链接https://arxiv.org/abs/2602.05602

作者:Zongliang Zhang,Shuxiang Li,Xingwang Huang,Zongyue Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:designed for classical, existing robust fitting, robust fitting methods, non-classical models, models

备注

点击查看摘要

Abstract:Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at this https URL.

45. 【2602.05598】CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

链接https://arxiv.org/abs/2602.05598

作者:Aon Safdar,Mohamed Saadeldin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:computer vision tasks, demonstrated strong performance, computer vision, vision tasks, modeling long-range spatial

备注: Presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 25) in the 4th Workshop on Transformers for Visions - T4V ( [this https URL](https://sites.google.com/view/t4v-cvpr25/) ) Accepted for Publication at 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025), where it was shortlisted for Best Paper Award. ( [this https URL](https://aicsconf.org/?page_id=278) )

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.

46. 【2602.05590】EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

链接https://arxiv.org/abs/2602.05590

作者:Haojie Cheng,Shaun Jing Heng Ong,Shaoyu Cai,Aiden Tat Yang Koh,Fuxi Ouyang,Eng Tat Khoo

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)

关键词:Immersive virtual reality, Immersive virtual, applications demand accurate, virtual reality, applications demand

备注

点击查看摘要

Abstract:Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.

47. 【2602.05588】A Mixed Reality System for Robust Manikin Localization in Childbirth Training

链接https://arxiv.org/abs/2602.05588

作者:Haojie Cheng,Chang Liu,Abhiram Kanneganti,Mahesh Arjandas Choolani,Arundhati Tushar Gosavi,Eng Tat Khoo

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)

关键词:shortened clinical rotations, gain practical experience, patient reluctance, clinical rotations, nature of labour

备注

点击查看摘要

Abstract:Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians' instructional burden and enhance trainees' learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.

48. 【2602.05582】Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation

链接https://arxiv.org/abs/2602.05582

作者:Joe-Mei Feng,Sheng-Wei Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified operator-theoretic framework, camera pose estimation, analyzing per-feature sensitivity, Fisher information geometry, present a unified

备注

点击查看摘要

Abstract:We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.05582 [cs.CV]

(or
arXiv:2602.05582v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.05582

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
49. 【2602.05578】LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

链接https://arxiv.org/abs/2602.05578

作者:Junyang Chen,Xiangbo Lv,Zhiqiang Kou,Xingdong Sheng,Ning Xu,Yiguo Qiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extends traditional closed-set, arbitrary textual descriptions, enabling pixel-wise annotation, traditional closed-set segmentation, extends traditional

备注

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.

50. 【2602.05577】LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization

链接https://arxiv.org/abs/2602.05577

作者:Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:posing unprecedented challenges, enabled highly controllable, Recent advancements, visual content, posing unprecedented

备注: 11 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising $231$K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.

51. 【2602.05574】A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features

链接https://arxiv.org/abs/2602.05574

作者:Mengyu Li,Ingibjörg Kristjánsdóttir,Thilo van Eimeren,Kathrin Giehl,Lotta M. Ellingsen, theASAP Neuroimaging Initiative

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Atypical Parkinsonian Disorders, Atypical Parkinsonian, Parkinsonian Disorders, progressive supranuclear palsy, multiple system atrophy

备注: To be published in Proceedings of SPIE Medical Imaging 2026

点击查看摘要

Abstract:Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson's disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.

52. 【2602.05573】Visual Implicit Geometry Transformer for Autonomous Driving

链接https://arxiv.org/abs/2602.05573

作者:Arsenii Shirokov,Mikhail Kuznetsov,Danila Stepochkin,Egor Evdokimov,Daniil Glazkov,Nikolay Patakin,Anton Konushin,Dmitry Senushkin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Implicit Geometry Transformer, Visual Implicit Geometry, Visual Implicit, introduce the Visual, surround-view camera rigs

备注

点击查看摘要

Abstract:We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{this https URL}{this https URL}.

53. 【2602.05572】ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors

链接https://arxiv.org/abs/2602.05572

作者:Zhenxiao Liang,Ning Zhang,Youbao Tang,Ruei-Sung Lin,Qixing Huang,Peng Chang,Jing Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human, vision priors, casual monocular videos, priors, monocular videos

备注

点击查看摘要

Abstract:We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.

54. 【2602.05557】PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds

链接https://arxiv.org/abs/2602.05557

作者:Michael Schwingshackl,Fabio F. Oberweger,Mario Niedermeyer,Huemer Johannes,Markus Murschitz

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:object detection framework, point cloud data, occlusion-affected point cloud, detection framework, present PIRATR

备注: 8 Pages, 11 Figures, Accepted at 2026 IEEE International Conference on Robotics Automation (ICRA) Vienna

点击查看摘要

Abstract:We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at this https URL.

55. 【2602.05555】IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools

链接https://arxiv.org/abs/2602.05555

作者:Panagiotis Sapoutzoglou,Orestis Vaggelis,Athina Zacharia,Evangelos Sartinas,Maria Pateraki

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:RGB-D benchmark dataset, tools and components, RGB-D benchmark, industrial tools, dataset

备注: To appear in ICRA 2026

点击查看摘要

Abstract:We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in this https URL.

56. 【2602.05552】VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

链接https://arxiv.org/abs/2602.05552

作者:Bessie Dominguez-Dager,Sergio Suescun-Ferrandiz,Felix Escalona,Francisco Gomez-Donoso,Miguel Cazorla

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:paper introduces VLN-Pilot, assumes the role, paper introduces, Model, introduces VLN-Pilot

备注

点击查看摘要

Abstract:This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.

57. 【2602.05551】FastVMT: Eliminating Redundancy in Video Motion Transfer

链接https://arxiv.org/abs/2602.05551

作者:Yue Ma,Zhikai Wang,Tianhao Ren,Mingzhe Zheng,Hongyu Liu,Jiayi Guo,Mark Fong,Yuxuan Xue,Zixiang Zhao,Konrad Schindler,Qifeng Chen,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:motion transfer aims, motion pattern observed, generating visual content, transfer aims, aims to synthesize

备注: Accepted by ICLR2026, Project page: [this http URL](http://fastvmt.gitHub.io) , Code: [this https URL](https://github.com/mayuelala/FastVMT)

点击查看摘要

Abstract:Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.

58. 【2602.05538】A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments

链接https://arxiv.org/abs/2602.05538

作者:Malaz Tamim,Andrea Matic-Flierl,Karsten Roscher

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:person detection, industrial monitoring, critical for safety, safety in applications, Accurate

备注: Accepted for VISAPP 2026

点击查看摘要

Abstract:Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.

59. 【2602.05536】When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

链接https://arxiv.org/abs/2602.05536

作者:Yayuan Li,Ze Peng,Jian Zhang,Jintao Guo,Yue Duan,Yinghuan Shi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:combines multiple fine-tuned, multiple fine-tuned models, merging combines multiple, providing a lightweight, alternative to retraining

备注

点击查看摘要

Abstract:Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: this https URL.

60. 【2602.05534】SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

链接https://arxiv.org/abs/2602.05534

作者:Youngwoo Shin,Jiwan Hur,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-fidelity synthesis mirroring, mirroring human perception, synthesis mirroring human, naturally achieving, next-scale prediction

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at this https URL.

61. 【2602.05527】Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains

链接https://arxiv.org/abs/2602.05527

作者:Ben Isselmann,Dilara Göksu,Andreas Weinmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learn robust feature, train deep learning, robust feature representations, small to train, train deep

备注: AMEE Conference Proceeding 2025, 11 pages, 2 figures

点击查看摘要

Abstract:Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.

62. 【2602.05522】Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification

链接https://arxiv.org/abs/2602.05522

作者:Jeongbin You,Donggun Kim,Sejun Park,Seungsang Oh

类目:Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)

关键词:specialized data augmentation, data augmentation, pursued by scaling, scaling up backbones, backbones or relying

备注

点击查看摘要

Abstract:Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.

63. 【2602.05508】VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency

链接https://arxiv.org/abs/2602.05508

作者:Zhuang Xiong,Chen Zhang,Qingshan Xu,Wenbing Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision foundation models, scale drift remains, drift remains severe, vision foundation, foundation models

备注

点击查看摘要

Abstract:Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.

64. 【2602.05496】XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

链接https://arxiv.org/abs/2602.05496

作者:Hanwen Zhang,Yao Liu,Peiyuan Jiang,Lang Junjie,Xie Jun,Yihui He,Yajiao Deng,Siyu Du,Qiao Liu

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Emotion Recognition plays, Multimodal Emotion Recognition, social media analytics, Emotional Cue Bridge, Recognition plays

备注

点击查看摘要

Abstract:Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

65. 【2602.05487】Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014

链接https://arxiv.org/abs/2602.05487

作者:Julien Moreau(Heudiasyc),S. Ambellouis,Yassine Ruichek(CIAD)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Photorealistic Fisheye Sequence, https URL, PFSeq for Photorealistic, Sequence and make, Photorealistic Fisheye

备注

点击查看摘要

Abstract:What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ''Photorealistic Fisheye Sequence'' and make available at this https URL. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ''Construction of 3D models from fisheye video data-Application to the localisation in urban area'' in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).

66. 【2602.05480】SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing

链接https://arxiv.org/abs/2602.05480

作者:Peihao Wu,Yongxiang Yao,Yi Wan,Wenfei Zhang,Ruipeng Zhao,Jiayuan Li,Yongjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, provide complementary strengths, transcending single-modality constraints

备注

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: this https URL.

67. 【2602.05467】MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

链接https://arxiv.org/abs/2602.05467

作者:Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:Visual Language Navigation, Visual Language, Language Navigation, fundamental capabilities, capabilities for embodied

备注: 9 pages, 2 figures, 5 tables, conference

点击查看摘要

Abstract:Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.

68. 【2602.05464】Refine and Purify: Orthogonal Basis Optimization with Null-Space Denoising for Conditional Representation Learning

链接https://arxiv.org/abs/2602.05464

作者:Jiaquan Wang,Yan Lyu,Chen Li,Yuheng Jia

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:representation learning aims, Conditional representation learning, extract criterion-specific features, obtain conditional representations, learning aims

备注

点击查看摘要

Abstract:Conditional representation learning aims to extract criterion-specific features for customized tasks. Recent studies project universal features onto the conditional feature subspace spanned by an LLM-generated text basis to obtain conditional representations. However, such methods face two key limitations: sensitivity to subspace basis and vulnerability to inter-subspace interference. To address these challenges, we propose OD-CRL, a novel framework integrating Adaptive Orthogonal Basis Optimization (AOBO) and Null-Space Denoising Projection (NSDP). Specifically, AOBO constructs orthogonal semantic bases via singular value decomposition with a curvature-based truncation. NSDP suppresses non-target semantic interference by projecting embeddings onto the null space of irrelevant subspaces. Extensive experiments conducted across customized clustering, customized classification, and customized retrieval tasks demonstrate that OD-CRL achieves a new state-of-the-art performance with superior generalization.

69. 【2602.05454】Attention Retention for Continual Learning with Vision Transformers

链接https://arxiv.org/abs/2602.05454

作者:Yue Lu,Xiangyu Zhou,Shizhou Zhang,Yinghui Xing,Guoqiang Liang,Wencong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:non-stationary data streams, progressively acquire knowledge, Continual learning, data streams, progressively acquire

备注: AAAI-2026 Camera Ready

点击查看摘要

Abstract:Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.

70. 【2602.05449】DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

链接https://arxiv.org/abs/2602.05449

作者:Chang Zou,Changlin Li,Yang Li,Patrol Li,Jianbing Wu,Xiao He,Songtao Liu,Zhao Zhong,Kailin Huang,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:escalating computational burden, achieved great success, rapidly escalating computational, computational burden, achieved great

备注: 17 pages, 7 figures; cvpr2026 submission

点击查看摘要

Abstract:While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.

71. 【2602.05440】Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations

链接https://arxiv.org/abs/2602.05440

作者:Natascha Jeziorski,Petra Gospodnetić,Claudia Redenbach

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:data, detection is crucial, synthetic, synthetic data, quality control

备注

点击查看摘要

Abstract:In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.

72. 【2602.05435】Stable Velocity: A Variance Perspective on Flow Matching

链接https://arxiv.org/abs/2602.05435

作者:Donglin Yang,Yongxing Zhang,Xin Yu,Liang Hou,Xin Tao,Pengfei Wan,Xiaojuan Qi,Renjie Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single-sample conditional velocities, conditional velocities leads, Stable Velocity, Stable Velocity Matching, slow convergence

备注

点击查看摘要

Abstract:While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at this https URL.

73. 【2602.05434】LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects

链接https://arxiv.org/abs/2602.05434

作者:Sanghoon Jeon,Gihyun Jung,Suhyeon Ka,Jae-Sang Hyun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low surface roughness, surface roughness remains, Fringe projection profilometry-based, projection profilometry-based, significant challenge

备注: 10 pages, 7 figures

点击查看摘要

Abstract:Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.

74. 【2602.05429】M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining

链接https://arxiv.org/abs/2602.05429

作者:Rui Lv,Juncheng Mo,Tianyi Chu,Chen Rao,Hongyi Jing,Jiajie Teng,Jiafu Chen,Shiqi Zhang,Liangzi Ding,Shuo Fang,Huaizhong Lin,Ziqiang Dang,Chenguang Ma,Lei Zhao

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, Graphical User, User Interface, advancing intelligent human-computer, human-computer interaction paradigms

备注: Accepted by ICLR 2026. Supplementary material is included at the end of the main paper (16 pages, 15 figures, 2 tables)

点击查看摘要

Abstract:Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M$^2$-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.

75. 【2602.05426】Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications

链接https://arxiv.org/abs/2602.05426

作者:Wahyu Rahmaniar,Kenji Suzuki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional deep learning, early disease diagnosis, Traditional deep, lack annotated data, anomaly detection

备注: 28 pages, 8 figures

点击查看摘要

Abstract:Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model's capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.

76. 【2602.05423】NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks

链接https://arxiv.org/abs/2602.05423

作者:Pengcheng Chen,Yue Hu,Wenhao Li,Nicole M Gunderson,Andrew Feng,Zhenglong Sun,Peter Beerel,Eric J Seibel

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:feed-forward systems, modern dense, explicitly output, geometry prediction, view synthesis

备注

点击查看摘要

Abstract:In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.

77. 【2602.05420】Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring

链接https://arxiv.org/abs/2602.05420

作者:Rui Sun,Yiwen Yang,Kaiyu Guo,Chen Jiang,Dongli Xu,Zhaonan Liu,Tan Pan,Limei Han,Xue Jiang,Wu Wei,Yuan Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate cell instance, digital pathology analysis, Accurate cell, foundational for digital, digital pathology

备注: 17 pages, 10 figures; ICLR 2026

点击查看摘要

Abstract:Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the "divide and conquer" principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, "Explicit Marking" strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a "conflict set." Second, "Implicit Disambiguation" mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.

78. 【2602.05415】VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection

链接https://arxiv.org/abs/2602.05415

作者:Ningkang Peng,Qianfeng Yu,Yuhao Zhang,Yafei Liu,Xiaoqian Peng,Peirong Ma,Yi Chen,Peiheng Li,Yanhui Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly challenging task, tail classes leads, blurred decision boundaries, Million Tiny Images, feature space

备注

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.

79. 【2602.05414】SBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions

链接https://arxiv.org/abs/2602.05414

作者:Ngoc Doan-Minh Huynh,Duong Nguyen-Ngoc Tran,Long Hoang Pham,Tai Huu-Phuong Tran,Hyung-Joon Jeon,Huy-Hung Nguyen,Duong Khac Vu,Hyung-Min Jeon,Son Hong Phan,Quoc Pham-Nam Ho,Chi Dai Tran,Trinh Le Ba Khanh,Jae Wook Jeon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:degrade CCTV signal, Global warming, degrade CCTV, CCTV signal, extreme weather events

备注: This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: this https URL.

80. 【2602.05397】Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing

链接https://arxiv.org/abs/2602.05397

作者:Yuechen Yang,Junlin Guo,Ruining Deng,Junchao Zhu,Zhengyi Lu,Chongyu Qu,Yanfan Zhu,Xingyi Guo,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offers rich quantitative, black-box deep learning, rich quantitative features, learning can provide, supporting more reproducible

备注

点击查看摘要

Abstract:Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., "second-order moment") remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**--**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.

81. 【2602.05391】Dataset Distillation via Relative Distribution Matching and Cognitive Heritage

链接https://arxiv.org/abs/2602.05391

作者:Qianxin Xia,Jiawei Du,Yuhan Zhang,Jielei Wang,Guoming Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly compact dataset, Dataset distillation seeks, optimizes synthetic images, seeks to synthesize, synthesize a highly

备注

点击查看摘要

Abstract:Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.

82. 【2602.05387】Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning

链接https://arxiv.org/abs/2602.05387

作者:Zolnamar Dorjsembe,Hung-Yi Chen,Furen Xiao,Hsing-Kuo Pao

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:superior soft tissue, soft tissue contrast, electron density information, density information limits, ionizing radiation

备注

点击查看摘要

Abstract:MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: this https URL.

83. 【2602.05384】Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

链接https://arxiv.org/abs/2602.05384

作者:Hao Feng,Wei Shi,Ke Zhang,Xiang Fei,Lei Liao,Dingkang Yang,Yongkun Du,Xuecheng Wu,Jingqun Tang,Yang Liu,Hong Chen,Can Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advance OCR capabilities, garnered widespread attention, advance OCR, OCR capabilities, garnered widespread

备注

点击查看摘要

Abstract:Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.

84. 【2602.05382】VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs

链接https://arxiv.org/abs/2602.05382

作者:Tina Khezresmaeilzadeh,Jike Zhong,Konstantinos Psounis

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vision Language Models, Vision Language, reliably perform nonverbal, Recent progress, progress in Vision

备注

点击查看摘要

Abstract:Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.

85. 【2602.05380】SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

链接https://arxiv.org/abs/2602.05380

作者:Xiaoxuan He,Siming Fu,Wanli Li,Zhiyuan Li,Dacheng Yin,Kang Rong,Fengyun Rao,Bo Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligning diffusion models, Aligning diffusion, preferences remains challenging, textbf, diffusion models

备注

点击查看摘要

Abstract:Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.

86. 【2602.05375】Erase at the Core: Representation Unlearning for Machine Unlearning

链接https://arxiv.org/abs/2602.05375

作者:Jaewon Lee,Yongwoo Kim,Donghyun Kim

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:preserve substantial information, internal feature representations, approximate machine unlearning, demonstrate strong logit-level, methods demonstrate strong

备注

点击查看摘要

Abstract:Many approximate machine unlearning methods demonstrate strong logit-level forgetting -- such as near-zero accuracy on the forget set -- yet continue to preserve substantial information within their internal feature representations. We refer to this discrepancy as superficial forgetting. Recent studies indicate that most existing unlearning approaches primarily alter the final classifier, leaving intermediate representations largely unchanged and highly similar to those of the original model. To address this limitation, we introduce the Erase at the Core (EC), a framework designed to enforce forgetting throughout the entire network hierarchy. EC integrates multi-layer contrastive unlearning on the forget set with retain set preservation through deeply supervised learning. Concretely, EC attaches auxiliary modules to intermediate layers and applies both contrastive unlearning and cross-entropy losses at each supervision point, with layer-wise weighted losses. Experimental results show that EC not only achieves effective logit-level forgetting, but also substantially reduces representational similarity to the original model across intermediate layers. Furthermore, EC is model-agnostic and can be incorporated as a plug-in module into existing unlearning methods, improving representation-level forgetting while maintaining performance on the retain set.

87. 【2602.05362】Imagine a City: CityGenAgent for Procedural 3D City Generation

链接https://arxiv.org/abs/2602.05362

作者:Zishan Liu,Zecong Tang,RuoCheng Wu,Xinzhe Zheng,Jingyu Hu,Ka-Hei Hui,Haoran Xie,Bo Dai,Zhengzhe Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:virtual reality, autonomous driving, embodied intelligence, critical challenge, challenge with broad

备注

点击查看摘要

Abstract:The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.

88. 【2602.05360】Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection

链接https://arxiv.org/abs/2602.05360

作者:Ningkang Peng,Xiaoqian Peng,Yuhao Zhang,Qianfeng Yu,Feng Xing,Peirong Ma,Xichen Yang,Yi Chen,Tingyu Lu,Yanhui Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically subtle OOD, subtle OOD samples, semantically simple samples, severe Geometric Blindness, distinguishing semantically subtle

备注

点击查看摘要

Abstract:While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model's sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.

89. 【2602.05359】Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

链接https://arxiv.org/abs/2602.05359

作者:Yiming Zhang,Qiangyu Yan,Borui Jiang,Kai Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:impressive perception capabilities, enabled impressive perception, multimodal large language, large language models, perception capabilities

备注

点击查看摘要

Abstract:The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.

90. 【2602.05349】Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection

链接https://arxiv.org/abs/2602.05349

作者:Ningkang Peng,JiuTao Zhou,Yuhao Zhang,Xiaoqian Peng,Qianfeng Yu,Linjing Qian,Tingyu Lu,Yi Chen,Yanhui Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning models, Static Homogeneity Assumption, real world, critical task, safe deployment

备注

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model's capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity $K_c^*$ for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.

91. 【2602.05339】Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

链接https://arxiv.org/abs/2602.05339

作者:Yongwoo Kim,Sungmin Cha,Hyunsoo Kim,Jaewon Lee,Donghyun Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:erase undesirable concepts, selectively erase undesirable, diffusion models, harmful content, increasing versatility

备注

点击查看摘要

Abstract:With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.

92. 【2602.05330】MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

链接https://arxiv.org/abs/2602.05330

作者:Jingdong Zhang,Xiaohang Zhan,Lingzhi Zhang,Yizhou Wang,Zhengming Yu,Jionghao Wang,Wenping Wang,Xin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Comprehensive panoramic scene, remains challenging due, panoramic scene understanding, Comprehensive panoramic, immersive applications

备注

点击查看摘要

Abstract:Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.

93. 【2602.05321】Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning

链接https://arxiv.org/abs/2602.05321

作者:Dongki Jung,Jaehoon Choi,Adil Qureshi,Somi Jeong,Dinesh Manocha,Suyong Yeon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual geometry reconstruction, visual geometry, feed-forward neural network, Abstract, neural network

备注

点击查看摘要

Abstract:We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.

94. 【2602.05305】FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

链接https://arxiv.org/abs/2602.05305

作者:Zhuokun Chen,Jianfei Cai,Bohan Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Generating long-form content, Generating long-form, modern generative models, long-form content, extended texts

备注

点击查看摘要

Abstract:Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: this https URL.

95. 【2602.05293】Fast-SAM3D: 3Dfy Anything in Images but Faster

链接https://arxiv.org/abs/2602.05293

作者:Weilun Feng,Mingqiang Wu,Zhiliang Chen,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiaokun Liu,Guoxin Fan,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prohibitive inference latency, enables scalable, reconstruction from complex, complex scenes, textbf

备注

点击查看摘要

Abstract:SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in this https URL.

96. 【2602.05275】Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

链接https://arxiv.org/abs/2602.05275

作者:Qi Li,Yanzhe Zhao,Yongxin Zhou,Yameng Wang,Yandong Yang,Yuanjia Zhou,Jue Wang,Zuojian Wang,Jinxiang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown immense promise, find relevant items, Multimodal Large Language, Large Language Models, universal multimodal retrieval

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.

97. 【2602.05271】Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning

链接https://arxiv.org/abs/2602.05271

作者:Shengqin Jiang,Xiaoran Feng,Yuankai Qi,Haokui Zhang,Renlong Hang,Qingshan Liu,Lina Yao,Quan Z. Sheng,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot class-incremental learning, previously acquired knowledge, preserving previously acquired, Few-shot class-incremental, class-incremental learning

备注: under review

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model's capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.

98. 【2602.05262】ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

链接https://arxiv.org/abs/2602.05262

作者:Junzhou Li,Manqi Zhao,Yilin Gao,Zhiheng Yu,Yin Li,Dongsheng Jiang,Li Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformer-based architectures, Large Receptive Field, Balancing accuracy, textbf, Gated Modulated Attention

备注: 11 pages, 4 figures

点击查看摘要

Abstract:Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.

99. 【2602.05257】RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation

链接https://arxiv.org/abs/2602.05257

作者:Diya He,Qingchen Liu,Cong Zhang,Jiahu Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:embodied intelligence, Object pose estimation, computer vision, vision and plays, plays a critical

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.

100. 【2602.05250】Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images

链接https://arxiv.org/abs/2602.05250

作者:Jieyun Tan,Shuo Liu,Guibin Zhang,Ziqi Li,Jian Geng,Lei Zhang,Lei Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:electron dense deposits, high-quality labeled data, Automated detection, dense deposits, labeled data

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.

101. 【2602.05238】PatchFlow: Leveraging a Flow-Based Model with Patch Features

链接https://arxiv.org/abs/2602.05238

作者:Boxiang Zhang,Baijian Yang,Xiaoming Wang,Corey Vian

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:craft intricate shapes, Die casting plays, plays a crucial, crucial role, industries due

备注

点击查看摘要

Abstract:Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20\% on the MVTec AD dataset, achieving an image-level AUROC of 99.28\%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48\%. Compared to the state-of-the-art models, this represents a 28.2\% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77\% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry

102. 【2602.05218】Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification

链接https://arxiv.org/abs/2602.05218

作者:Jiahao Nie,Yun Xing,Wenbin An,Qingsong Zhao,Jiawei Shao,Yap-Peng Tan,Alex C. Kot,Shijian Lu,Xuelong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Segment Anything Model, recent studies leverage, studies leverage SAM, predict object masks, recent studies

备注

点击查看摘要

Abstract:Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.

103. 【2602.05217】Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation

链接https://arxiv.org/abs/2602.05217

作者:Jiahao Nie,Guanqiao Fu,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Cross-Domain Few-Shot Segmentation, Few-Shot Segmentation aims, Segmentation aims, Few-Shot Segmentation, data-scarce domains conditioned

备注

点击查看摘要

Abstract:Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).

104. 【2602.05215】E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

链接https://arxiv.org/abs/2602.05215

作者:Jiahao Nie,Wenbin An,Gongjie Zhang,Yicheng Xu,Yap-Peng Tan,Alex C. Kot,Shijian Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Video Large Language, Temporal Video Grounding, Video Grounding, Language Models

备注

点击查看摘要

Abstract:Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose this http URL, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. this http URL introduces three key innovations: (i) a special event token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that this http URL consistently outperforms state-of-the-art Vid-LLMs by significant margins.

105. 【2602.05213】Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

链接https://arxiv.org/abs/2602.05213

作者:Chuqin Zhou,Xiaoyue Ling,Yunuo Chen,Jincheng Dai,Guo Lu,Wenjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent neural codecs, effectiveness deteriorates significantly, ultra-low bitrate conditions, neural codecs achieve, codecs achieve strong

备注

点击查看摘要

Abstract:While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.

106. 【2602.05204】Extreme Weather Nowcasting via Local Precipitation Pattern Prediction

链接https://arxiv.org/abs/2602.05204

作者:Changhoon Song,Teng Yuan Chang,Youngjoon Hong

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:disaster mitigation, storms is critical, critical for risk, risk management, management and disaster

备注: 10pages, 20 figures, The Fourteenth International Conference on Learning Representations, see [this https URL](https://github.com/tony890048/exPreCast)

点击查看摘要

Abstract:Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed--either dominated by ordinary rainfall events or restricted to extreme rainfall episodes--thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.

107. 【2602.05202】GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

链接https://arxiv.org/abs/2602.05202

作者:Shivanshu Shekhar,Uttaran Bhattacharya,Raghavendra Addanki,Mehrab Tanjim,Somdeb Sarkhel,Tong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligning video generative, human preferences remains, video generative models, preferences remains challenging, current approaches rely

备注

点击查看摘要

Abstract:Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

108. 【2602.05190】PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction

链接https://arxiv.org/abs/2602.05190

作者:Ju Shen,Chen Chen,Tam V. Nguyen,Vijayan K. Asari

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:pose-guided Gaussian Splatting, view synthesis, Gaussian Splatting pipelines, Gaussian Splatting, Gaussian Splatting framework

备注

点击查看摘要

Abstract:We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).

109. 【2602.05175】ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification

链接https://arxiv.org/abs/2602.05175

作者:Zhe Li,Bernhard Kainz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, neural networks demonstrate, networks demonstrate impressive, demonstrate impressive performance, Deep neural

备注: 10 pages, 5 figures

点击查看摘要

Abstract:Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves $84.06\%$ clean accuracy and $81.64\%$ robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the $80\%$ threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.

110. 【2602.05163】LOBSTgER-enhance: an underwater image enhancement pipeline

链接https://arxiv.org/abs/2602.05163

作者:Andreas Mentzelopoulos,Keith Ellenbogen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including reduced contrast, presents significant inherent, significant inherent challenges, inherent challenges including, challenges including reduced

备注: 12 pages, 30 figures, work done as part of LOBSTgER

点击查看摘要

Abstract:Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions. We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images.

Comments:
12 pages, 30 figures, work done as part of LOBSTgER

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.05163 [cs.CV]

(or
arXiv:2602.05163v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.05163

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
111. 【2602.05162】SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition

链接https://arxiv.org/abs/2602.05162

作者:Anay Majee,Rishabh Iyer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep neural networks, Deep neural, sensitive attributes, leading to unfair, unfair predictions

备注: 21 pages, 7 tables, 10 figures

点击查看摘要

Abstract:Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.

112. 【2602.05159】AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves

链接https://arxiv.org/abs/2602.05159

作者:Wenhui Cui,Ziyi Kou,Chuan Qin,Ergys Ristani,Li Guan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robotic policy learning, provide rich signals, acceleration and tactile, tactile feedback, important tools

备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.

113. 【2602.05132】ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation

链接https://arxiv.org/abs/2602.05132

作者:Jia Li,Wenjie Zhao,Shijian Deng,Bolin Lai,Yuheng Wu,RUijia Chen,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assistive technologies, camera wearer, first-person video, task essential, essential for augmented

备注

点击查看摘要

Abstract:Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.

114. 【2602.05126】CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology

链接https://arxiv.org/abs/2602.05126

作者:Weiyi Qin,Yingci Liu-Swetz,Shiwei Tan,Hao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human papillomavirus, Explainable Attention-guided Representation, cervical cancers, critical determinant, determinant of prognosis

备注

点击查看摘要

Abstract:Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

115. 【2602.05100】Rule-Based Spatial Mixture-of-Experts U-Net for Explainable Edge Detection

链接https://arxiv.org/abs/2602.05100

作者:Bharadwaj Dogga,Kaaustaaub Shankar,Gibin Raju,Wilhelm Louw,Kelly Cohen

类目:Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)

关键词:Generative AI services, image generation models, edge detection tasks, detection tasks, services world-wide

备注

点击查看摘要

Abstract:Deep learning models like U-Net and its variants, have established state-of-the-art performance in edge detection tasks and are used by Generative AI services world-wide for their image generation models. However, their decision-making processes remain opaque, operating as "black boxes" that obscure the rationale behind specific boundary predictions. This lack of transparency is a critical barrier in safety-critical applications where verification is mandatory. To bridge the gap between high-performance deep learning and interpretable logic, we propose the Rule-Based Spatial Mixture-of-Experts U-Net (sMoE U-Net). Our architecture introduces two key innovations: (1) Spatially-Adaptive Mixture-of-Experts (sMoE) blocks integrated into the decoder skip connections, which dynamically gate between "Context" (smooth) and "Boundary" (sharp) experts based on local feature statistics; and (2) a Takagi-Sugeno-Kang (TSK) Fuzzy Head that replaces the standard classification layer. This fuzzy head fuses deep semantic features with heuristic edge signals using explicit IF-THEN rules. We evaluate our method on the BSDS500 benchmark, achieving an Optimal Dataset Scale (ODS) F-score of 0.7628, effectively matching purely deep baselines like HED (0.7688) while outperforming the standard U-Net (0.7437). Crucially, our model provides pixel-level explainability through "Rule Firing Maps" and "Strategy Maps," allowing users to visualize whether an edge was detected due to strong gradients, high semantic confidence, or specific logical rule combinations.

116. 【2602.05096】Visual concept ranking uncovers medical shortcuts used by large multimodal models

链接https://arxiv.org/abs/2602.05096

作者:Joseph D. Janizek,Sonnet Xu,Junayd Lateef,Roxana Daneshjou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:healthcare requires auditing, Ensuring the reliability, uncover model shortcomings, requires auditing methods, machine learning models

备注

点击查看摘要

Abstract:Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.

117. 【2602.05081】Gabor Fields: Orientation-Selective Level-of-Detail for Volume Rendering

链接https://arxiv.org/abs/2602.05081

作者:Jorge Condor,Nicolai Hermann,Mehmet Ata Yurtsever,Piotr Didyk

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled efficient physically-based, voxel-based distributions, physically-based volume rendering, efficient physically-based volume, physically-based volume

备注: 19 pages, incl Appendix and References

点击查看摘要

Abstract:Gaussian-based representations have enabled efficient physically-based volume rendering at a fraction of the memory cost of regular, discrete, voxel-based distributions. However, several remaining issues hamper their widespread use. One of the advantages of classic voxel grids is the ease of constructing hierarchical representations by either storing volumetric mipmaps or selectively pruning branches of an already hierarchical voxel grid. Such strategies reduce rendering time and eliminate aliasing when lower levels of detail are required. Constructing similar strategies for Gaussian-based volumes is not trivial. Straightforward solutions, such as prefiltering or computing mipmap-style representations, lead to increased memory requirements or expensive re-fitting of each level separately. Additionally, such solutions do not guarantee a smooth transition between different hierarchy levels. To address these limitations, we propose Gabor Fields, an orientation-selective mixture of Gabor kernels that enables continuous frequency filtering at no cost. The frequency content of the asset is reduced by selectively pruning primitives, directly benefiting rendering performance. Beyond filtering, we demonstrate that stochastically sampling from different frequencies and orientations at each ray recursion enables masking substantial portions of the volume, accelerating ray traversal time in single- and multiple-scattering settings. Furthermore, inspired by procedural volumes, we present an application for efficient design and rendering of procedural clouds as Gabor-noise-modulated Gaussians.

118. 【2602.05078】Food Portion Estimation: From Pixels to Calories

链接https://arxiv.org/abs/2602.05078

作者:Gautham Vinod,Fengqing Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词:image-based dietary assessment, individual health, diseases and obesity, dietary assessment suffers, dietary assessment

备注

点击查看摘要

Abstract:Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.

119. 【2602.05049】VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models

链接https://arxiv.org/abs/2602.05049

作者:Yiye Chen,Yanan Jian,Xiaoyi Dong,Shuxin Cao,Jing Wu,Patricio Vela,Benjamin E. Lundell,Dongdong Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:demonstrated strong performance, robotic manipulation tasks, demonstrated strong, wide range, range of robotic

备注: In submission. Project website: [this https URL](https://vista-vla.github.io/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: this https URL .

120. 【2602.05037】UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking

链接https://arxiv.org/abs/2602.05037

作者:Bishoy Galoaa,Xiangyu Bai,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significantly enhance multi-object, directly optimizing tracking-specific, graph-theoretic loss function, optimizing tracking-specific objectives, loss function designed

备注

点击查看摘要

Abstract:We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.

121. 【2602.05029】Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

链接https://arxiv.org/abs/2602.05029

作者:Octavio Arriaga,Proneet Sharma,Jichen Guo,Marc Otto,Siddhant Kadwe,Rebecca Adam

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:requires robotic systems, Operating effectively, real-world environments requires, environments requires robotic, previously unseen objects

备注: Submitted to IEEE Robotics and Automation Letters (RA-L) for review. This version includes the statement required by IEEE for preprints

点击查看摘要

Abstract:Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.

122. 【2602.05013】Untwisting RoPE: Frequency Control for Shared Attention in DiTs

链接https://arxiv.org/abs/2602.05013

作者:Aryan Mikaeili,Or Patashnik,Andrea Tagliasacchi,Daniel Cohen-Or,Ali Mahdavi-Amiri

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Rotary Positional Embeddings, fully understood, encodings are essential, multimodal and attention-sharing, attention-sharing settings

备注

点击查看摘要

Abstract:Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities. We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior. Building on these insights, we introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.

123. 【2602.04994】SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

链接https://arxiv.org/abs/2602.04994

作者:Zhuosen Bao,Xia Du,Zheng Lin,Jizhe Zhou,Zihan Fang,Jiening Wu,Yuxin Zhang,Zhe Chen,Chi-man Pun,Wei Ni,Jun Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:achieving effective decoupling, face privacy protection, privacy protection, online banking, networked services

备注: 14 pages, 8 figures

点击查看摘要

Abstract:With the deep integration of facial recognition into online banking, identity verification, and other networked services, achieving effective decoupling of identity information from visual representations during image storage and transmission has become a critical challenge for privacy protection. To address this issue, we propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection. SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component. By leveraging semantic-guided recomposition in the latent space of a diffusion model, it generates visually anonymous adversarial faces while maintaining machine-level identity consistency. The framework incorporates momentum-driven unrestricted perturbation optimization and a semantic-visual balancing factor to synthesize multiple visually diverse, highly natural adversarial samples. Furthermore, for authorized access, the protected image can be restored to its original form when the correct password is provided. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that SIDeR achieves a 99% attack success rate in black-box scenarios and outperforms baseline methods by 41.28% in PSNR-based restoration quality.

124. 【2602.04908】mporal Pair Consistency for Variance-Reduced Flow Matching

链接https://arxiv.org/abs/2602.04908

作者:Chika Maduabuchi,Jindong Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Continuous-time generative models, learn time-dependent vector, time-dependent vector fields, treat timesteps independently, Continuous-time generative

备注

点击查看摘要

Abstract:Continuous-time generative models, such as diffusion models, flow matching, and rectified flow, learn time-dependent vector fields but are typically trained with objectives that treat timesteps independently, leading to high estimator variance and inefficient sampling. Prior approaches mitigate this via explicit smoothness penalties, trajectory regularization, or modified probability paths and solvers. We introduce Temporal Pair Consistency (TPC), a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path, operating entirely at the estimator level without modifying the model architecture, probability path, or solver. We provide a theoretical analysis showing that TPC induces a quadratic, trajectory-coupled regularization that provably reduces gradient variance while preserving the underlying flow-matching objective. Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods, and extends seamlessly to modern SOTA-style pipelines with noise-augmented training, score-based denoising, and rectified flow.

125. 【2602.05738】Disc-Centric Contrastive Learning for Lumbar Spine Severity Grading

链接https://arxiv.org/abs/2602.05738

作者:Sajjan Acharya,Pralisha Kansakar

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:automated severity grading, work examines, examines a disc-centric, disc-centric approach, approach for automated

备注

点击查看摘要

Abstract:This work examines a disc-centric approach for automated severity grading of lumbar spinal stenosis from sagittal T2-weighted MRI. The method combines contrastive pretraining with disc-level fine-tuning, using a single anatomically localized region of interest per intervertebral disc. Contrastive learning is employed to help the model focus on meaningful disc features and reduce sensitivity to irrelevant differences in image appearance. The framework includes an auxiliary regression task for disc localization and applies weighted focal loss to address class imbalance. Experiments demonstrate a 78.1% balanced accuracy and a reduced severe-to-normal misclassification rate of 2.13% compared with supervised training from scratch. Detecting discs with moderate severity can still be challenging, but focusing on disc-level features provides a practical way to assess the lumbar spinal stenosis.

126. 【2602.05453】owards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis

链接https://arxiv.org/abs/2602.05453

作者:Budhaditya Mukhopadhyay,Chirag Mandal,Pavan Tummala,Naghmeh Mahmoodian,Andreas Nürnberger,Soumick Chatterjee

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

关键词:pre-operative MRI, Liver tumour ablation, tumour ablation presents, intra-operative CT due, due to minimal

备注: Accepted for AIBio at ECAI 2025

点击查看摘要

Abstract:Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the "domain gap" and "feature absence" problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: this https URL

127. 【2602.05208】Context-Aware Asymmetric Ensembling for Interpretable Retinopathy of Prematurity Screening via Active Query and Vascular Attention

链接https://arxiv.org/abs/2602.05208

作者:Md. Mehedi Hassan,Taufiq Hasan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Retinopathy of Prematurity, preventable childhood blindness, childhood blindness, preventable childhood, Asymmetric Ensemble Model

备注: 16 pages, 6 figures

点击查看摘要

Abstract:Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box' transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model's visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.

128. 【2602.05047】QuantumGS: Quantum Encoding Framework for Gaussian Splatting

链接https://arxiv.org/abs/2602.05047

作者:Grzegorz Wilczyński,Rafał Tobiasz,Paweł Gora,Marcin Mazur,Przemysław Spurek

类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled real-time rendering, Recent advances, Gaussian Splatting, Direction Gaussian Splatting, complex scenes

备注

点击查看摘要

Abstract:Recent advances in neural rendering, particularly 3D Gaussian Splatting (3DGS), have enabled real-time rendering of complex scenes. However, standard 3DGS relies on spherical harmonics, which often struggle to accurately capture high-frequency view-dependent effects such as sharp reflections and transparency. While hybrid approaches like Viewing Direction Gaussian Splatting (VDGS) mitigate this limitation using classical Multi-Layer Perceptrons (MLPs), they remain limited by the expressivity of classical networks in low-parameter regimes. In this paper, we introduce QuantumGS, a novel hybrid framework that integrates Variational Quantum Circuits (VQC) into the Gaussian Splatting pipeline. We propose a unique encoding strategy that maps the viewing direction directly onto the Bloch sphere, leveraging the natural geometry of qubits to represent 3D directional data. By replacing classical color-modulating networks with quantum circuits generated via a hypernetwork or conditioning mechanism, we achieve higher expressivity and better generalization. Source code is available in the supplementary material. Code is available at this https URL

129. 【2602.04890】A General-Purpose Diversified 2D Seismic Image Dataset from NAMSS

链接https://arxiv.org/abs/2602.04890

作者:Lucas de Magalhães Araujo,Otávio Oliveira Napoli,Sandra Avila,Edson Borin

类目:Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:geographically distributed collection, support modern machine, seismic sections designed, Marine Seismic Surveys, modern machine learning

备注

点击查看摘要

Abstract:We introduce the Unicamp-NAMSS dataset, a large, diverse, and geographically distributed collection of migrated 2D seismic sections designed to support modern machine learning research in geophysics. We constructed the dataset from the National Archive of Marine Seismic Surveys (NAMSS), which contains decades of publicly available marine seismic data acquired across multiple regions, acquisition conditions, and geological settings. After a comprehensive collection and filtering process, we obtained 2588 cleaned and standardized seismic sections from 122 survey areas, covering a wide range of vertical and horizontal sampling characteristics. To ensure reliable experimentation, we balanced the dataset so that no survey dominates the distribution, and partitioned it into non-overlapping macro-regions for training, validation, and testing. This region-disjoint split allows robust evaluation of generalization to unseen geological and acquisition conditions. We validated the dataset through quantitative and embedding-space analyses using both convolutional and transformer-based models. These analyses showed that Unicamp-NAMSS exhibits substantial variability within and across regions, while maintaining coherent structure across acquisition macro-region and survey types. Comparisons with widely used interpretation datasets (Parihaka and F3 Block) further demonstrated that Unicamp-NAMSS covers a broader portion of the seismic appearance space, making it a strong candidate for machine learning model pretraining. The dataset, therefore, provides a valuable resource for machine learning tasks, including self-supervised representation learning, transfer learning, benchmarking supervised tasks such as super-resolution or attribute prediction, and studying domain adaptation in seismic interpretation.

Subjects:

Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.04890 [physics.geo-ph]

(or
arXiv:2602.04890v1 [physics.geo-ph] for this version)

https://doi.org/10.48550/arXiv.2602.04890

Focus to learn more

              arXiv-issued DOI via DataCite</p>