本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1255篇论文,其中:

  • 自然语言处理166
  • 信息检索44
  • 计算机视觉252

自然语言处理

1. 【2602.09012】Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

链接https://arxiv.org/abs/2602.09012

作者:Jiacheng Liu,Yaxin Luo,Jiacheng Cui,Xinyi Shang,Xiaohan Zhao,Zhiqiang Shen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:traditional CAPTCHAs obsolete, rendered traditional CAPTCHAs, rapid evolution, evolution of GUI-enabled, rendered traditional

备注: Project page at [this https URL](https://greenoso.github.io/NextGen-CAPTCHAs_webpage/)

点击查看摘要

Abstract:The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

2. 【2602.09003】Data Science and Technology Towards AGI Part I: Tiered Data Management

链接https://arxiv.org/abs/2602.09003

作者:Yudong Wang,Zixuan Fu,Hengyu Zhao,Chen Zhao,Chuyue Zhou,Xinle Lin,Hongya Lyu,Shuaikang Xue,Yi Yi,Yingjiao Wang,Zhi Zheng,Yuzhou Zhang,Jie Zhou,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:continuously driving advances, data, data management, utilization continuously driving, data-driven learning paradigms

备注: 16 pages, 3 figures, 7 tables

点击查看摘要

Abstract:The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

3. 【2602.08997】Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

链接https://arxiv.org/abs/2602.08997

作者:Lavender Y. Jiang,Xujin Chris Liu,Kyunghyun Cho,Eric K. Oermann

类目:Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词:HIPAA Safe Harbor, Safe Harbor, sustains patient-provider trust, HIPAA Safe, sustains patient-provider

备注

点击查看摘要

Abstract:Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.

4. 【2602.08995】When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

链接https://arxiv.org/abs/2602.08995

作者:Yuting Ning,Jaylen Jones,Zhehao Zhang,Chentao Ye,Weitong Ruan,Junyi Li,Rahul Gupta,Huan Sun

类目:Computation and Language (cs.CL)

关键词:user original intent, made tremendous progress, Computer-use agents, frequently produce misaligned, produce misaligned actions

备注: Project Homepage: [this https URL](https://osu-nlp-group.github.io/Misaligned-Action-Detection/)

点击查看摘要

Abstract:Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.

5. 【2602.08984】Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models

链接https://arxiv.org/abs/2602.08984

作者:Yuliang Liu,Yunchong Song,Yixuan Wang,Kewen Ge,Alex Lamb,Qipeng Guo,Kai Chen,Bowen Zhou,Zhouhan Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Token Prediction, Concept Prediction, Prediction, generative pretraining paradigm, pretraining paradigm built

备注

点击查看摘要

Abstract:We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.

6. 【2602.08979】Beyond Transcripts: A Renewed Perspective on Audio Chaptering

链接https://arxiv.org/abs/2602.08979

作者:Fabian Retkowski,Maike Züfle,Thai Binh Nguyen,Jan Niehues,Alexander Waibel

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:automatically segmenting long-form, segmenting long-form audio, coherent sections, navigating podcasts, task of automatically

备注

点击查看摘要

Abstract:Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

7. 【2602.08964】A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

链接https://arxiv.org/abs/2602.08964

作者:Raghu Arghal,Fade Chen,Niall Dalton,Evgenii Kortukov,Calum McNamara,Angelos Nalmpantis,Moksh Nirvaan,Gabriele Sarti,Mario Giulianelli

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:reliably attributing goals, predict its behaviour, agentic systems, explain and predict, established methodology

备注

点击查看摘要

Abstract:Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

8. 【2602.08951】How Should We Model the Probability of a Language?

链接https://arxiv.org/abs/2602.08951

作者:Rasul Dent,Pedro Ortiz Suarez,Thibault Clérice,Benoît Sagot

类目:Computation and Language (cs.CL)

关键词:commercial language identification, written form, reliably identify, hundred in written, commercial language

备注: Accepted for Vardial 2026

点击查看摘要

Abstract:Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

9. 【2602.08948】CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

链接https://arxiv.org/abs/2602.08948

作者:Chen Jin,Ryutaro Tanno,Tom Diethe,Philip Teare

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, incurs substantial compute, Large Language, Language Models, boost reasoning accuracy

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.

10. 【2602.08945】GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search

链接https://arxiv.org/abs/2602.08945

作者:Sahajpreet Singh,Kokil Jaidka,Min-Yen Kan

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Community-based moderation offers, faces significant structural, existing AI-based methods, AI-based methods fail, significant structural challenges

备注: 18 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Community-based moderation offers a scalable alternative to centralized fact-checking, yet it faces significant structural challenges, and existing AI-based methods fail in "cold start" scenarios. To tackle these challenges, we introduce GitSearch (Gap-Informed Targeted Search), a framework that treats human-perceived quality gaps, such as missing context, etc., as first-class signals. GitSearch has a three-stage pipeline: identifying information deficits, executing real-time targeted web-retrieval to resolve them, and synthesizing platform-compliant notes. To facilitate evaluation, we present PolBench, a benchmark of 78,698 U.S. political tweets with their associated Community Notes. We find GitSearch achieves 99% coverage, almost doubling coverage over the state-of-the-art. GitSearch surpasses human-authored helpful notes with a 69% win rate and superior helpfulness scores (3.87 vs. 3.36), demonstrating retrieval effectiveness that balanced the trade-off between scale and quality.

11. 【2602.08874】Is Reasoning Capability Enough for Safety in Long-Context Language Models?

链接https://arxiv.org/abs/2602.08874

作者:Yu Fu,Haz Sameen Shahgir,Huanli Gong,Zhipeng Wei,N. Benjamin Erichson,Yue Dong

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Large language models, synthesize information distributed, Large language, increasingly combine long-context, combine long-context processing

备注: 25 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.

12. 【2602.08872】Large Language Models for Geolocation Extraction in Humanitarian Crisis Response

链接https://arxiv.org/abs/2602.08872

作者:G. Cafferata,T. Demarco,K. Kalimeri,Y. Mejova,M.G. Beiró

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:crises demand timely, Humanitarian crises demand, effective response efforts, inform effective response, Large Language Models

备注

点击查看摘要

Abstract:Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.

13. 【2602.08864】Understanding Dynamic Compute Allocation in Recurrent Transformers

链接https://arxiv.org/abs/2602.08864

作者:Ibraheem Muhammad Moosa,Suhas Lohit,Ye Wang,Moitreya Chatterjee,Wenpeng Yin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:reduce inference cost, seeks to reduce, reduce inference, inference cost, harder tokens

备注

点击查看摘要

Abstract:Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

14. 【2602.08857】Discovering Interpretable Algorithms by Decompiling Transformers to RASP

链接https://arxiv.org/abs/2602.08857

作者:Xinting Huang,Aleksandra Bakalova,Satwik Bhattamishra,William Merrill,Michael Hahn

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent work, RASP programs, Transformers, simple RASP programs, RASP

备注: 101 pages, 92 figures

点击查看摘要

Abstract:Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

15. 【2602.08829】WildReward: Learning Reward Models from In-the-Wild Human Interactions

链接https://arxiv.org/abs/2602.08829

作者:Hao Peng,Yunjia Qi,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large-scale human-annotated preference, large language models, human-annotated preference pairs, large language, typically rely

备注

点击查看摘要

Abstract:Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at this https URL.

16. 【2602.08826】Affective Flow Language Model for Emotional Support Conversation

链接https://arxiv.org/abs/2602.08826

作者:Chenghui Zou,Ning Wang,Tiesunlong Shen,Luwei Xiao,Chuan Ma,Xiangpeng Li,Rui Mao,Erik Cambria

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, emotional support conversation, widely applied, support conversation

备注: 19 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains this http URL is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at this https URL.

17. 【2602.08819】Bayesian Preference Learning for Test-Time Steerable Reward Models

链接https://arxiv.org/abs/2602.08819

作者:Jiwoo Hong,Shao Tang,Zhipeng Wang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:aligning language models, reinforcement learning, central to aligning, aligning language, Reward Modeling

备注: Preprint

点击查看摘要

Abstract:Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

18. 【2602.08796】he Use of AI Tools to Develop and Validate Q-Matrices

链接https://arxiv.org/abs/2602.08796

作者:Kevin Fan,Jacquelyn A. Bialo,Hongli Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:cognitive diagnostic modeling, validated Q-matrix, diagnostic modeling, Q-matrix, critical but labor-intensive

备注: An earlier version of this study was presented at the Psychometric Society Meeting held in July 2025 in Minneapolis, USA

点击查看摘要

Abstract:Constructing a Q-matrix is a critical but labor-intensive step in cognitive diagnostic modeling (CDM). This study investigates whether AI tools (i.e., general language models) can support Q-matrix development by comparing AI-generated Q-matrices with a validated Q-matrix from Li and Suen (2013) for a reading comprehension test. In May 2025, multiple AI models were provided with the same training materials as human experts. Agreement among AI-generated Q-matrices, the validated Q-matrix, and human raters' Q-matrices was assessed using Cohen's kappa. Results showed substantial variation across AI models, with Google Gemini 2.5 Pro achieving the highest agreement (Kappa = 0.63) with the validated Q-matrix, exceeding that of all human experts. A follow-up analysis in January 2026 using newer AI versions, however, revealed lower agreement with the validated Q-matrix. Implications and directions for future research are discussed.

19. 【2602.08793】LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

链接https://arxiv.org/abs/2602.08793

作者:Yushi Sun,Xujia Li,Nan Tang,Quanqing Xu,Chuanhui Yang,Lei Chen

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:Column type annotation, data lake, vital for tasks, data, Column type

备注

点击查看摘要

Abstract:Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.

20. 【2602.08783】Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

链接https://arxiv.org/abs/2602.08783

作者:Zirui Li,Xuefeng Bai,Kehai Chen,Yizhi Li,Jian Yang,Chenghua Lin,Min Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:methods replace explicit, internal latent steps, replace explicit textual, explicit textual rationales, modeling latent steps

备注: 22 pages

点击查看摘要

Abstract:Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise $\mathrm{do}$-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output-level commitment differ from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses -- and corresponding training/decoding objectives -- as more reliable tools for interpreting and improving latent reasoning systems.

21. 【2602.08740】Map of Encoders -- Mapping Sentence Encoders using Quantum Relative Entropy

链接https://arxiv.org/abs/2602.08740

作者:Gaifan Zhang,Danushka Bollegala

类目:Computation and Language (cs.CL)

关键词:sentence, sentence encoders, sentence encoder, visualise sentence encoders, encoders

备注

点击查看摘要

Abstract:We propose a method to compare and visualise sentence encoders at scale by creating a map of encoders where each sentence encoder is represented in relation to the other sentence encoders. Specifically, we first represent a sentence encoder using an embedding matrix of a sentence set, where each row corresponds to the embedding of a sentence. Next, we compute the Pairwise Inner Product (PIP) matrix for a sentence encoder using its embedding matrix. Finally, we create a feature vector for each sentence encoder reflecting its Quantum Relative Entropy (QRE) with respect to a unit base encoder. We construct a map of encoders covering 1101 publicly available sentence encoders, providing a new perspective of the landscape of the pre-trained sentence encoders. Our map accurately reflects various relationships between encoders, where encoders with similar attributes are proximally located on the map. Moreover, our encoder feature vectors can be used to accurately infer downstream task performance of the encoders, such as in retrieval and clustering tasks, demonstrating the faithfulness of our map.

22. 【2602.08716】PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments

链接https://arxiv.org/abs/2602.08716

作者:Shangrui Nie,Kian Omoomi,Lucie Flek,Zhixue Zhao,Charles Welch

类目:Computation and Language (cs.CL)

关键词:developing large language, reflect human heterogeneity, faithfully reflect human, large language models, capacity to engage

备注: 15 pages, 1 figure

点击查看摘要

Abstract:Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies. Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation. Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse. We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded to multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives.

23. 【2602.08709】FactSim: Fact-Checking for Opinion Summarization

链接https://arxiv.org/abs/2602.08709

作者:Leandro Anghinoni,Jorge Sanchez

类目:Computation and Language (cs.CL)

关键词:generative artificial intelligence, precise evaluation techniques, text summarization tasks, summarization tasks, artificial intelligence

备注: 10 pages, 4 figures

点击查看摘要

Abstract:We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.

24. 【2602.08700】Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search

链接https://arxiv.org/abs/2602.08700

作者:Clemencia Siro,Zahra Abbasiantaeb,Yifei Yuan,Mohammad Aliannejadi,Maarten de Rijke

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:clarifying questions, increasingly employ clarifying, clarifying, questions, answering clarifying questions

备注: Accepted at CHIIR 2025

点击查看摘要

Abstract:Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.

25. 【2602.08698】Challenges in Translating Technical Lectures: Insights from the NPTEL

链接https://arxiv.org/abs/2602.08698

作者:Basudha Raje,Sadanand Venkatraman,Nandana TP,Soumyadeepa Das,Polkam Poojitha,M. Vijaykumar,Tanima Bagchi,Hema A. Murthy

类目:Computation and Language (cs.CL)

关键词:emerging translation workflows, existing evaluation frameworks, Machine Translation, specifically Bangla, implications of Machine

备注

点击查看摘要

Abstract:This study examines the practical applications and methodological implications of Machine Translation in Indian Languages, specifically Bangla, Malayalam, and Telugu, within emerging translation workflows and in relation to existing evaluation frameworks. The choice of languages prioritized in this study is motivated by a triangulation of linguistic diversity, which illustrates the significance of multilingual accommodation of educational technology under NEP 2020. This is further supported by the largest MOOC portal, i.e., NPTEL, which has served as a corpus to facilitate the arguments presented in this paper. The curation of a spontaneous speech corpora that accounts for lucid delivery of technical concepts, considering the retention of suitable register and lexical choices are crucial in a diverse country like India. The findings of this study highlight metric-specific sensitivity and the challenges of morphologically rich and semantically compact features when tested against surface overlapping metrics.

26. 【2602.08696】Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

链接https://arxiv.org/abs/2602.08696

作者:Haoshen Wang,Xueli Zhong,Bingbing Lin,Jia Huang,Xingduo Pan,Shengxiang Liang,Nizhuan Wang,Wai Ting Siok

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:posing major challenges, exhibits high variability, limited labeled data, automatic speech recognition, assistive speech technologies

备注

点击查看摘要

Abstract:Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.

Subjects:

Sound (cs.SD); Computation and Language (cs.CL)

Cite as:
arXiv:2602.08696 [cs.SD]

(or
arXiv:2602.08696v1 [cs.SD] for this version)

https://doi.org/10.48550/arXiv.2602.08696

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
27. 【2602.08688】Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement

链接https://arxiv.org/abs/2602.08688

作者:Hossein Kermani,Fatemeh Oudlajani,Pardis Yarahmadi,Hamideh Mahdi Soltani,Mohammad Makki,Zahra HosseiniKhoo

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:human qualitative coding, human qualitative, qualitative coding, supervised learning, paper compares

备注

点击查看摘要

Abstract:This paper compares three approaches to detecting incivility in Persian tweets: human qualitative coding, supervised learning with ParsBERT, and large language models (ChatGPT). Using 47,278 tweets from the #MahsaAmini movement in Iran, we evaluate the accuracy and efficiency of each method. ParsBERT substantially outperforms seven evaluated ChatGPT models in identifying hate speech. We also find that ChatGPT struggles not only with subtle cases but also with explicitly uncivil content, and that prompt language (English vs. Persian) does not meaningfully affect its outputs. The study provides a detailed comparison of these approaches and clarifies their strengths and limitations for analyzing hate speech in a low-resource language context.

28. 【2602.08672】Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

链接https://arxiv.org/abs/2602.08672

作者:Clemencia Siro,Pourya Aliannejadi,Mohammad Aliannejadi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:assess system outputs, applying human-defined rubrics, Large language models, natural language generation, Large language

备注: Accepted at EACL 2026 Findings

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.

29. 【2602.08658】Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models

链接https://arxiv.org/abs/2602.08658

作者:Mingzi Cao,Xingwei Tan,Mahmud Akhter,Marco Valentino,Maria Liakata,Xi Wang,Nikolaos Aletras

类目:Computation and Language (cs.CL)

关键词:human logical thinking, logical thinking, human logical, Deduction, improving Large Language

备注

点击查看摘要

Abstract:Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs' reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.

30. 【2602.08632】We Should Separate Memorization from Copyright

链接https://arxiv.org/abs/2602.08632

作者:Adi Haviv,Niva Elkin-Koren,Uri Hacohen,Roi Livni,Shay Moran

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:models has introduced, risk factor, foundation models, copyright, Abstract

备注

点击查看摘要

Abstract:The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.

31. 【2602.08625】Do Multilingual LLMs have specialized language heads?

链接https://arxiv.org/abs/2602.08625

作者:Muhammad Naufil

类目:Computation and Language (cs.CL)

关键词:gained significant popularity, Multilingual large language, gained significant, significant popularity, ability to process

备注

点击查看摘要

Abstract:Multilingual large language models (LLMs) have gained significant popularity for their ability to process and generate text across multiple languages. However, deploying these models in production can be inefficient when only a subset of the supported languages is of interest. There has been some research conducted on identifying whether machine translation models have language-specific or language-agnostic heads, however no research has been conducted for multilingual LLMs, to the best of our knowledge, that as we know are capable of performing diverse tasks beyond just translation. This paper explores whether multilingual LLMs have specialized language attention heads for each language, and investigates the possibility of removing language-specific heads for unwanted languages without degrading performance in the targeted languages. Our findings could inform more efficient deployment strategies for multilingual LLMs, enabling reduced model complexity while maintaining high accuracy for targeted languages.

32. 【2602.08607】VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

链接https://arxiv.org/abs/2602.08607

作者:Ziyang Cheng,Yuhao Wang,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:Large Language Models, Recent Speech Large, Speech Large Language, Language Models, Large Language

备注

点击查看摘要

Abstract:Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.

33. 【2602.08600】Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

链接https://arxiv.org/abs/2602.08600

作者:Archchana Sindhujan,Girish A. Koushik,Shenbin Qian,Diptesh Kanojia,Constantin Orăsan

类目:Computation and Language (cs.CL)

关键词:Quality Estimation, large-scale MT evaluation, aims to assess, outputs without relying, making it essential

备注: Currently this article is under review for Natural Language Processing Journal

点击查看摘要

Abstract:Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.

34. 【2602.08567】ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems

链接https://arxiv.org/abs/2602.08567

作者:Jinnuo Liu,Chuke Liu,Hua Shen

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:Multi-agent large language, large language model, systems increasingly consist, large language, increasingly consist

备注: Preprint. Under review. 18 pages, 9 figures

点击查看摘要

Abstract:Multi-agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations propagate through agent interactions remains poorly understood. We present ValueFlow, a perturbation-based evaluation framework for measuring and analyzing value drift in multi-agent systems. ValueFlow introduces a 56-value evaluation dataset derived from the Schwartz Value Survey and quantifies agents' value orientations during interaction using an LLM-as-a-judge protocol. Building on this measurement layer, ValueFlow decomposes value drift into agent-level response behavior and system-level structural effects, operationalized by two metrics: beta-susceptibility, which measures an agent's sensitivity to perturbed peer signals, and system susceptibility (SS), which captures how node-level perturbations affect final system outputs. Experiments across multiple model backbones, prompt personas, value dimensions, and network structures show that susceptibility varies widely across values and is strongly shaped by structural topology.

35. 【2602.08561】Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

链接https://arxiv.org/abs/2602.08561

作者:Syed Mehtab Hussain Shah,Frank Hopfgartner,Arnim Bleier

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Reproducing computational research, Reproducing computational, provided data, rerunning the original, Reproducing

备注: 12 pages, 5 figures. Submitted to ACM conference

点击查看摘要

Abstract:Reproducing computational research is often assumed to be as simple as rerunning the original code with provided data. In practice, missing packages, fragile file paths, version conflicts, or incomplete logic frequently cause analyses to fail, even when materials are shared. This study investigates whether large language models and AI agents can automate the diagnosis and repair of such failures, making computational results easier to reproduce and verify. We evaluate this using a controlled reproducibility testbed built from five fully reproducible R-based social science studies. Realistic failures were injected, ranging from simple issues to complex missing logic, and two automated repair workflows were tested in clean Docker environments. The first workflow is prompt-based, repeatedly querying language models with structured prompts of varying context, while the second uses agent-based systems that inspect files, modify code, and rerun analyses autonomously. Across prompt-based runs, reproduction success ranged from 31-79 percent, with performance strongly influenced by prompt context and error complexity. Complex cases benefited most from additional context. Agent-based workflows performed substantially better, with success rates of 69-96 percent across all complexity levels. These results suggest that automated workflows, especially agent-based systems, can significantly reduce manual effort and improve reproduction success across diverse error types. Unlike prior benchmarks, our testbed isolates post-publication repair under controlled failure modes, allowing direct comparison of prompt-based and agent-based approaches.

36. 【2602.08548】How Do Language Models Understand Tables? A Mechanistic Analysis of Cell Location

链接https://arxiv.org/abs/2602.08548

作者:Xuanliang Zhang,Dingzirui Wang,Keyan Xu,Qingfu Zhu,Wanxiang Che

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, linearized two-dimensional structured, tables remain opaque, process linearized two-dimensional

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly deployed for table-related tasks, the internal mechanisms enabling them to process linearized two-dimensional structured tables remain opaque. In this work, we investigate the process of table understanding by dissecting the atomic task of cell location. Through activation patching and complementary interpretability techniques, we delineate the table understanding mechanism into a sequential three-stage pipeline: Semantic Binding, Coordinate Localization, and Information Extraction. We demonstrate that models locate the target cell via an ordinal mechanism that counts discrete delimiters to resolve coordinates. Furthermore, column indices are encoded within a linear subspace that allows for precise steering of model focus through vector arithmetic. Finally, we reveal that models generalize to multi-cell location tasks by multiplexing the identical attention heads identified during atomic location. Our findings provide a comprehensive explanation of table understanding within Transformer architectures.

37. 【2602.08543】GISA: A Benchmark for General Information-Seeking Assistant

链接https://arxiv.org/abs/2602.08543

作者:Yutao Zhu,Xingshuo Zhang,Maosen Zhang,Jiajie Jin,Liancheng Zhang,Xiaoshuai Song,Kangzhi Zhao,Wencong Zeng,Ruiming Tang,Han Li,Ji-Rong Wen,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:multi-turn web interactions, large language models, web interactions, advancement of large, large language

备注

点击查看摘要

Abstract:The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.

38. 【2602.08503】Learning Self-Correction in Vision-Language Models via Rollout Augmentation

链接https://arxiv.org/abs/2602.08503

作者:Yi Ding,Ziliang Qiu,Bolian Li,Ruqi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:vision-language models, solving complex reasoning, complex reasoning problems, essential for solving, solving complex

备注: 17 pages

点击查看摘要

Abstract:Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

39. 【2602.08498】Characterizing, Evaluating, and Optimizing Complex Reasoning

链接https://arxiv.org/abs/2602.08498

作者:Haoran Zhang,Yafu Li,Zhi Wang,Zhilin Wang,Shunkai Zhang,Xiaoye Qu,Yu Cheng

类目:Computation and Language (cs.CL)

关键词:Large Reasoning Models, Large Reasoning, increasingly rely, Reasoning, reasoning traces

备注: Code and data are available at \url{ [this https URL](https://github.com/zzzhr97/TRM) }

点击查看摘要

Abstract:Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.

40. 【2602.08489】Beyond Correctness: Learning Robust Reasoning via Transfer

链接https://arxiv.org/abs/2602.08489

作者:Hyunseok Lee,Soheil Abbasloo,Jihoon Tack,Jinwoo Shin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Learning with Verifiable, recently strengthened LLM, Reinforcement Learning, Verifiable Rewards, answer correctness leaves

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR's average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.

41. 【2602.08437】Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

链接https://arxiv.org/abs/2602.08437

作者:Ziyan wang,Longlong Ma

类目:Computation and Language (cs.CL)

关键词:Promise of CHATGPT, False Promise, mere pattern predictors, Large Language Models, Chomsky provocative critique

备注

点击查看摘要

Abstract:In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.

42. 【2602.08426】Prism: Spectral-Aware Block-Sparse Attention

链接https://arxiv.org/abs/2602.08426

作者:Xinghao Wang,Pengyu Wang,Xiaoran Liu,Fangxu Liu,Jason Chu,Kai Song,Xipeng Qiu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:long-context LLM pre-filling, accelerating long-context LLM, LLM pre-filling, long-context LLM, identifying relevant blocks

备注

点击查看摘要

Abstract:Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

43. 【2602.08404】EAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

链接https://arxiv.org/abs/2602.08404

作者:Linye Wei,Zixiang Luo,Pingzhi Tang,Meng Li

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, recently gained significant, gained significant attention, significant attention due, Diffusion large

备注

点击查看摘要

Abstract:Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at this https URL.

44. 【2602.08382】Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

链接https://arxiv.org/abs/2602.08382

作者:Zhuoen Chen,Dongfang Li,Meishan Zhang,Baotian Hu,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, face significant challenges, including quadratic computational

备注: 26 pages, 7 figures. Code and models will be released

点击查看摘要

Abstract:Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.

45. 【2602.08377】Reinforcement Learning with Backtracking Feedback

链接https://arxiv.org/abs/2602.08377

作者:Bilgehan Sel,Vaishakh Keshava,Phillip Wallis,Lukas Rutishauser,Ming Jin,Dingcheng Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, introduce Reinforcement Learning, Large Language, Reinforcement Learning, Addressing the critical

备注: NeurIPS 2025

点击查看摘要

Abstract:Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model's live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient "backtrack by x tokens" signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition of this backtracking capability, we also propose an enhanced Supervised Fine-Tuning (SFT) data generation strategy (BSAFE+). This method improves upon previous data creation techniques by injecting violations into coherent, originally safe text, providing more effective initial training for the backtracking mechanism. Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility.

46. 【2602.08371】ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts

链接https://arxiv.org/abs/2602.08371

作者:Hung Quang Tran,Nam Tien Pham,Son T. Luu,Kiet Van Nguyen

类目:Computation and Language (cs.CL)

关键词:harmful content detection, content detection, plays a significant, significant role, prediction and harmful

备注: Accepted as main paper at EACL 2026

点击查看摘要

Abstract:Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions -- a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.

47. 【2602.08369】MemAdapter: Fast Alignment across Agent Memory Paradigms via Generative Subgraph Retrieval

链接https://arxiv.org/abs/2602.08369

作者:Xin Zhang,Kailai Yang,Chenyue Li,Hao Li,Qiyu Wei,Jun'ichi Tsujii,Sophia Ananiadou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Memory, agent memory systems, memory paradigms, agent memory, enabling reasoning

备注

点击查看摘要

Abstract:Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retrieval framework that enables fast alignment across agent memory paradigms. MemAdapter adopts a two-stage training strategy: (1) training a generative subgraph retriever from the unified memory space, and (2) adapting the retriever to unseen memory paradigms by training a lightweight alignment module through contrastive learning. This design improves the flexibility for memory retrieval and substantially reduces alignment cost across paradigms. Comprehensive experiments on three public evaluation benchmarks demonstrate that the generative subgraph retriever consistently outperforms five strong agent memory systems across three memory paradigms and agent model scales. Notably, MemAdapter completes cross-paradigm alignment within 13 minutes on a single GPU, achieving superior performance over original memory retrievers with less than 5% of training compute. Furthermore, MemAdapter enables effective zero-shot fusion across memory paradigms, highlighting its potential as a plug-and-play solution for agent memory systems.

48. 【2602.08367】WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

链接https://arxiv.org/abs/2602.08367

作者:Zexuan Wang,Chenghao Yang,Yingqi Que,Zhenzhu Yang,Huaqing Yuan,Yiwen Wang,Zhengxuan Jiang,Shengjie Fang,Zhenhe Wu,Zhaohui Wang,Zhixin Yao,Jiashuo Liu,Jincheng Ren,Yuzhen Li,Yang Yang,Jiaheng Liu,Jian Yang,Zaiyuan Wang,Ge Zhang,Zhoufutu Wen,Wenhao Huang

类目:Computation and Language (cs.CL)

关键词:requires coordinating tightly, coordinating tightly coupled, single decision dictates, autonomous planning requires, planning requires coordinating

备注

点击查看摘要

Abstract:Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.

49. 【2602.08343】ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection

链接https://arxiv.org/abs/2602.08343

作者:Debajyoti Datta,Trishala Neeraj,Bibek Paudel,Vyom Sharma,Subhabrata Mukherjee

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Long-context inference, KV-cache memory, sequence length, inference is constrained, grows linearly

备注: 18 pages, 5 figures, 18 tables

点击查看摘要

Abstract:Long-context inference is constrained by KV-cache memory, which grows linearly with sequence length; KV-cache compression therefore hinges on reliably selecting which past tokens to retain. Most geometry-based eviction methods score keys by cosine similarity to a global centroid, but cosine is scale-invariant and can discard magnitude cues that distinguish semantically salient tokens. We propose ManifoldKV, a training-free scorer that ranks tokens by Euclidean distance to the key centroid, capturing both angular and radial deviations. On the RULER benchmark, ManifoldKV achieves 95.7% accuracy at 4K-16K contexts with 20% compression; matching the best geometric baseline while improving robustness in two regimes where cosine scoring fails. First, on multi-key retrieval, ManifoldKV reduces directional collisions, achieving 92.4% vs KeyDiff's 77.0% (+15.4 points) on 3-key NIAH at 50% compression. Second, to address dilution and performance collapse of global centroids at 64K context, we introduce WindowedManifoldKV, which restores accuracy to 84.3% at 25% compression, a 49-point recovery over global L2 and +3.2 points over KeyDiff. The method requires only 3 lines of code and works across 4 architectures without tuning.

Comments:
18 pages, 5 figures, 18 tables

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.08343 [cs.LG]

(or
arXiv:2602.08343v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.08343

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2602.08336】UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

链接https://arxiv.org/abs/2602.08336

作者:Cheng Yang,Chufan Shi,Bo Shui,Yaokang Wu,Muzi Tao,Huijuan Wang,Ivan Yee Lee,Yong Liu,Xuezhe Ma,Taylor Berg-Kirkpatrick

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:implicit visual requirements, recent unified multimodal, models increasingly adopt, multimodal models increasingly, increasingly adopt

备注: Project page: [this https URL](https://ureason.github.io)

点击查看摘要

Abstract:To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

51. 【2602.08332】Latent Reasoning with Supervised Thinking States

链接https://arxiv.org/abs/2602.08332

作者:Ido Amos,Avi Caciularu,Mor Geva,Amir Globerson,Jonathan Herzig,Lior Shani,Idan Szpektor

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, enables Large Language, incurs significant inference, significant inference costs, inference costs due

备注

点击查看摘要

Abstract:Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.

52. 【2602.08322】An Attention-over-Attention Generative Model for Joint Multiple Intent Detection and Slot Filling

链接https://arxiv.org/abs/2602.08322

作者:Wei Zhu

类目:Computation and Language (cs.CL)

关键词:spoken language understanding, task-oriented dialogue systems, spoken language, language understanding, critical component

备注

点击查看摘要

Abstract:In task-oriented dialogue systems, spoken language understanding (SLU) is a critical component, which consists of two sub-tasks, intent detection and slot filling. Most existing methods focus on the single-intent SLU, where each utterance only has one intent. However, in real-world scenarios users usually express multiple intents in an utterance, which poses a challenge for existing dialogue systems and datasets. In this paper, we propose a generative framework to simultaneously address multiple intent detection and slot filling. In particular, an attention-over-attention decoder is proposed to handle the variable number of intents and the interference between the two sub-tasks by incorporating an inductive bias into the process of multi-task learning. Besides, we construct two new multi-intent SLU datasets based on single-intent utterances by taking advantage of the next sentence prediction (NSP) head of the BERT model. Experimental results demonstrate that our proposed attention-over-attention generative model achieves state-of-the-art performance on two public datasets, MixATIS and MixSNIPS, and our constructed datasets.

53. 【2602.08321】Improving Data and Reward Design for Scientific Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.08321

作者:Zijie Chen,Zhenghao Lin,Xiao Liu,Zhenzhong Lan,Yeyun Gong,Peng Cheng

类目:Computation and Language (cs.CL)

关键词:inherently unreliable supervision, Solving open-ended science, large language models, questions remains challenging, Solving open-ended

备注

点击查看摘要

Abstract:Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT - RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

54. 【2602.08305】JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation

链接https://arxiv.org/abs/2602.08305

作者:Binglin Wu,Yingyi Zhang,Xiannneg Li

类目:Computation and Language (cs.CL)

关键词:Automated judgment document, textbf, Automated judgment, judgment document generation, significant yet challenging

备注

点击查看摘要

Abstract:Automated judgment document generation is a significant yet challenging legal AI task. As the conclusive written instrument issued by a court, a judgment document embodies complex legal reasoning. However, existing methods often oversimplify this complex process, particularly by omitting the ``Pre-Judge'' phase, a crucial step where human judges form a preliminary conclusion. This omission leads to two core challenges: 1) the ineffective acquisition of foundational judicial elements, and 2) the inadequate modeling of the Pre-Judge process, which collectively undermine the final document's legal soundness. To address these challenges, we propose \textit{\textbf{J}udicial \textbf{U}nified \textbf{S}ynthesis \textbf{T}hrough \textbf{I}ntermediate \textbf{C}onclusion \textbf{E}mulation} (JUSTICE), a novel framework that emulates the ``Search $\rightarrow$ Pre-Judge $\rightarrow$ Write'' cognitive workflow of human judges. Specifically, it introduces the Pre-Judge stage through three dedicated components: Referential Judicial Element Retriever (RJER), Intermediate Conclusion Emulator (ICE), and Judicial Unified Synthesizer (JUS). RJER first retrieves legal articles and a precedent case to establish a referential foundation. ICE then operationalizes the Pre-Judge phase by generating a verifiable intermediate conclusion. Finally, JUS synthesizes these inputs to craft the final judgment. Experiments on both an in-domain legal benchmark and an out-of-distribution dataset show that JUSTICE significantly outperforms strong baselines, with substantial gains in legal accuracy, including a 4.6\% improvement in prison term prediction. Our findings underscore the importance of explicitly modeling the Pre-Judge process to enhance the legal coherence and accuracy of generated judgment documents.

55. 【2602.08294】When Does Context Help? Error Dynamics of Contextual Information in Large Language Models

链接https://arxiv.org/abs/2602.08294

作者:Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

类目:Computation and Language (cs.CL)

关键词:large language models, role remains poorly, remains poorly understood, theoretical role remains, retrieved knowledge

备注

点击查看摘要

Abstract:Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by $0.6\%$.

56. 【2602.08289】Knowledge Augmented Entity and Relation Extraction for Legal Documents with Hypergraph Neural Network

链接https://arxiv.org/abs/2602.08289

作者:Binglin Wu,Xianneng Li

类目:Computation and Language (cs.CL)

关键词:Chinese judicial institutions, digitization in Chinese, electronic legal document, Chinese judicial, continuous progress

备注

点击查看摘要

Abstract:With the continuous progress of digitization in Chinese judicial institutions, a substantial amount of electronic legal document information has been accumulated. To unlock its potential value, entity and relation extraction for legal documents has emerged as a crucial task. However, existing methods often lack domain-specific knowledge and fail to account for the unique characteristics of the judicial domain. In this paper, we propose an entity and relation extraction algorithm based on hypergraph neural network (Legal-KAHRE) for drug-related judgment documents. Firstly, we design a candidate span generator based on neighbor-oriented packing strategy and biaffine mechanism, which identifies spans likely to contain entities. Secondly, we construct a legal dictionary with judicial domain knowledge and integrate it into text encoding representation using multi-head attention. Additionally, we incorporate domain-specific cases like joint crimes and combined punishment for multiple crimes into the hypergraph structure design. Finally, we employ a hypergraph neural network for higher-order inference via message passing. Experimental results on the CAIL2022 information extraction dataset demonstrate that our method significantly outperforms existing baseline models.

57. 【2602.08281】New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

链接https://arxiv.org/abs/2602.08281

作者:Zhilin Wang,Yafu Li,Shunkai Zhang,Zhi Wang,Haoran Zhang,Xiaoye Qu,Yu Cheng

类目:Computation and Language (cs.CL)

关键词:endows Large Language, Large Language Models, elicits latent traces, latent traces remains, Reinforcement Learning

备注: 15 pages

点击查看摘要

Abstract:Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

58. 【2602.08274】Language Modeling and Understanding Through Paraphrase Generation and Detection

链接https://arxiv.org/abs/2602.08274

作者:Jan Philip Wahle

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:share knowledge, innovation across generations, pass on strategies, strategies for survival, survival and innovation

备注: PhD dissertation, University of Göttingen Germany, 2025. 182 pages

点击查看摘要

Abstract:Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...

59. 【2602.08252】Language Predicts Identity Fusion Across Cultures and Reveals Divergent Pathways to Violence

链接https://arxiv.org/abs/2602.08252

作者:Devin R. Wright,Justin E. Lane,F. LeRon Shults

类目:Computation and Language (cs.CL)

关键词:understanding the psychological, increasingly important, light of increasing, increasing polarization, polarization and political

备注: Initial submitted version

点击查看摘要

Abstract:In light of increasing polarization and political violence, understanding the psychological roots of extremism is increasingly important. Prior research shows that identity fusion predicts willingness to engage in extreme acts. We evaluate the Cognitive Linguistic Identity Fusion Score, a method that uses cognitive linguistic patterns, LLMs, and implicit metaphor to measure fusion from language. Across datasets from the United Kingdom and Singapore, this approach outperforms existing methods in predicting validated fusion scores. Applied to extremist manifestos, two distinct high-fusion pathways to violence emerge: ideologues tend to frame themselves in terms of group, forming kinship bonds; whereas grievance-driven individuals frame the group in terms of their personal identity. These results refine theories of identity fusion and provide a scalable tool aiding fusion research and extremism detection.

60. 【2602.08238】On convexity and efficiency in semantic systems

链接https://arxiv.org/abs/2602.08238

作者:Nathaniel Imel,Noga Zaslavasky

类目:Computation and Language (cs.CL)

关键词:widely held characterizations, human semantic category, form convex partitions, conceptual spaces, efficient for communication

备注

点击查看摘要

Abstract:There are two widely held characterizations of human semantic category systems: (1) they form convex partitions of conceptual spaces, and (2) they are efficient for communication. While prior work observed that convexity and efficiency co-occur in color naming, the analytical relation between them and why they co-occur have not been well understood. We address this gap by combining analytical and empirical analyses that build on the Information Bottleneck (IB) framework for semantic efficiency. First, we show that convexity and efficiency are distinct in the sense that neither entails the other: there are convex systems which are inefficient, and optimally-efficient systems that are non-convex. Crucially, however, the IB-optimal systems are mostly convex in the domain of color naming, explaining the main empirical basis for the convexity approach. Second, we show that efficiency is a stronger predictor for discriminating attested color naming systems from hypothetical variants, with convexity adding negligible improvement on top of that. Finally, we discuss a range of empirical phenomena that convexity cannot account for but efficiency can. Taken together, our work suggests that while convexity and efficiency can yield similar structural observations, they are fundamentally distinct, with efficiency providing a more comprehensive account of semantic typology.

61. 【2602.08237】Document Reconstruction Unlocks Scalable Long-Context RLVR

链接https://arxiv.org/abs/2602.08237

作者:Yao Xiao,Lei Wang,Yue Deng,Guanzheng Chen,Ziqi Jin,Jung-jae Kim,Xiaoli Li,Roy Ka-wei Lee,Lidong Bing

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Verifiable Rewards, Language Models, Learning with Verifiable

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.

62. 【2602.08236】When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

链接https://arxiv.org/abs/2602.08236

作者:Shoubin Yu,Yue Zhang,Zun Wang,Jaehong Yoon,Huaxiu Yao,Mingyu Ding,Mohit Bansal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, correct answers depend, progress in Multimodal

备注: the first two authors are equally contributed. Project page: [this https URL](https://adaptive-visual-tts.github.io/)

点击查看摘要

Abstract:Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

63. 【2602.08235】When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

链接https://arxiv.org/abs/2602.08235

作者:Jaylen Jones,Zhehao Zhang,Yuting Ning,Eric Fosler-Lussier,Pierre-Luc St-Charles,Yoshua Bengio,Dawn Song,Yu Su,Huan Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:hold significant potential, automate increasingly complex, demonstrate unsafe unintended, hold significant, complex OS workflows

备注: Project Homepage: [this https URL](https://osu-nlp-group.github.io/AutoElicit/)

点击查看摘要

Abstract:Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

64. 【2602.08221】CoRect: Context-Aware Logit Contrast for Hidden State Rectification to Resolve Knowledge Conflicts

链接https://arxiv.org/abs/2602.08221

作者:Xuhua Ma,Richong Zhang,Zhijie Nie

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:overrides retrieved evidence, knowledge overrides retrieved, Retrieval-Augmented Generation, model-internal parametric knowledge, parametric knowledge overrides

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) often struggles with knowledge conflicts, where model-internal parametric knowledge overrides retrieved evidence, leading to unfaithful outputs. Existing approaches are often limited, relying either on superficial decoding adjustments or weight editing that necessitates ground-truth targets. Through layer-wise analysis, we attribute this failure to a parametric suppression phenomenon: specifically, in deep layers, certain FFN layers overwrite context-sensitive representations with memorized priors. To address this, we propose CoRect (Context-Aware Logit Contrast for Hidden State Rectification). By contrasting logits from contextualized and non-contextualized forward passes, CoRect identifies layers that exhibit high parametric bias without requiring ground-truth labels. It then rectifies the hidden states to preserve evidence-grounded information. Across question answering (QA) and summarization benchmarks, CoRect consistently improves faithfulness and reduces hallucinations compared to strong baselines.

65. 【2602.08220】Pretraining with Token-Level Adaptive Latent Chain-of-Thought

链接https://arxiv.org/abs/2602.08220

作者:Boyi Zeng,Yiqin Hao,He Li,Shixiang Song,Feichen Song,Zitong Wang,Siyuan Huang,Yi Xu,ZiWei He,Xinbing Wang,Zhouhan Lin

类目:Computation and Language (cs.CL)

关键词:rising communication costs, Scaling large language, limited high-quality corpora, Scaling large, Adaptive Latent CoT

备注

点击查看摘要

Abstract:Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.

66. 【2602.08213】DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning

链接https://arxiv.org/abs/2602.08213

作者:Haoran Liu,Zheni Zeng,Yukun Yan,Yuxuan Chen,Yunduo Xiao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)

关键词:chemical domain, fundamental task, task in chemical, Molecule generation, Abstract

备注

点击查看摘要

Abstract:Molecule generation and optimization is a fundamental task in chemical domain. The rapid development of intelligent tools, especially large language models (LLMs) with powerful knowledge reserves and interactive capabilities, has provided new paradigms for it. Nevertheless, the intrinsic challenge for LLMs lies in the complex implicit relationship between molecular structure and pharmacological properties and the lack of corresponding labeled data. To bridge this gap, we propose DrugR, an LLM-based method that introduces explicit, step-by-step pharmacological reasoning into the optimization process. Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning. This framework enables DrugR to effectively improve key ADMET properties while preserving the original molecule's core efficacy. Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity. Importantly, its explicit reasoning process provides clear, interpretable rationales for each optimization step, yielding actionable design insights and advancing toward automated, knowledge-driven scientific discovery. Our code and model checkpoints are open-sourced to foster future research.

67. 【2602.08208】LLMs and people both learn to form conventions -- just not with each other

链接https://arxiv.org/abs/2602.08208

作者:Cameron R. Jones,Agnese Lombardi,Kyle Mahowald,Benjamin K. Bergen

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:adopting shared conventions, ease communication, multimodal communication game, adopting shared, Humans align

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Humans align to one another in conversation -- adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail -- suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.

68. 【2602.08194】Dreaming in Code for Curriculum Learning in Open-Ended Worlds

链接https://arxiv.org/abs/2602.08194

作者:Konstantinos Mitsides,Maxence Faldor,Antoine Cully

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:learning frames intelligence, frames intelligence, intelligence as emerging, emerging from continual, continual interaction

备注: 11 pages (main text), 90 pages total. Project page: [this https URL](https://konstantinosmitsides.github.io/dreaming-in-code)

点击查看摘要

Abstract:Open-ended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open-ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, "dreaming" takes the form of materializing code-level variations of the world. We instantiate DiCode in Craftax, a challenging open-ended benchmark characterized by rich mechanics and long-horizon progression. Empirically, DiCode enables agents to acquire long-horizon skills, achieving a $16\%$ improvement in mean return over the strongest baseline and non-zero success on late-game combat tasks where prior methods fail. Our results suggest that code-level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open-ended worlds. Project page and source code are available at this https URL and this https URL.

69. 【2602.08169】Spherical Steering: Geometry-Aware Activation Rotation for Language Models

链接https://arxiv.org/abs/2602.08169

作者:Zejia You,Chunyuan Deng,Hanjie Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:controlling language models, cost of retraining, promising paradigm, paradigm for controlling, controlling language

备注: The code is at: [this https URL](https://github.com/chili-lab/Spherical-Steering)

点击查看摘要

Abstract:Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.

70. 【2602.08162】NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark

链接https://arxiv.org/abs/2602.08162

作者:Ricardo Campos,José Pedro Evans,José Miguel Isidro,Miguel Marques,Luís Filipe Cunha,Alípio Jorge,Sérgio Nunes,Nuno Guimarães

类目:Computation and Language (cs.CL)

关键词:procedural actions unfold, Local governance meeting, documenting how proposals, minutes or transcripts, form of minutes

备注

点击查看摘要

Abstract:Local governance meeting records are official documents, in the form of minutes or transcripts, documenting how proposals, discussions, and procedural actions unfold during institutional meetings. While generally structured, these documents are often dense, bureaucratic, and highly heterogeneous across municipalities, exhibiting significant variation in language, terminology, structure, and overall organization. This heterogeneity makes them difficult for non-experts to interpret and challenging for intelligent automated systems to process, limiting public transparency and civic engagement. To address these challenges, computational methods can be employed to structure and interpret such complex documents. In particular, Natural Language Processing (NLP) offers well-established methods that can enhance the accessibility and interpretability of governmental records. In this focus article, we review foundational NLP tasks that support the structuring of local governance meeting documents. Specifically, we review three core tasks: document segmentation, domain-specific entity extraction and automatic text summarization, which are essential for navigating lengthy deliberations, identifying political actors and personal information, and generating concise representations of complex decision-making processes. In reviewing these tasks, we discuss methodological approaches, evaluation metrics, and publicly available resources, while highlighting domain-specific challenges such as data scarcity, privacy constraints, and source variability. By synthesizing existing work across these foundational tasks, this article provides a structured overview of how NLP can enhance the structuring and accessibility of local governance meeting records.

71. 【2602.08159】he Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

链接https://arxiv.org/abs/2602.08159

作者:Seonglae Cho,Zekun Wu,Kleyton Da Costa,Adriano Koshiyama

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Australia is Sydney, language model asserts, capital of Australia, AUC, language model

备注

点击查看摘要

Abstract:When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.

72. 【2602.08149】DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

链接https://arxiv.org/abs/2602.08149

作者:Sahana Ramnath,Nima Chitsazan,Mingyang Zhou,Chia-Hsuan Lee,Shi-Xiong Zhang,Stephen Rawls,Sambit Sahu,Sangwoo Cho,Xiang Ren,Genta Indra Winata,Akshaj Kumar Veldanda

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:revise key points, key points discussed, automatically generated summaries, communication for humans, product users

备注

点击查看摘要

Abstract:Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.

73. 【2602.08145】Reliable and Responsible Foundation Models: A Comprehensive Survey

链接https://arxiv.org/abs/2602.08145

作者:Xinyu Yang,Junlin Han,Rishi Bommasani,Jinqi Luo,Wenjie Qu,Wangchunshu Zhou,Adel Bibi,Xiyao Wang,Jaehong Yoon,Elias Stengel-Eskin,Shengbang Tong,Lingfeng Shen,Rafael Rafailov,Runjia Li,Zhaoyang Wang,Yiyang Zhou,Chenhang Cui,Yu Wang,Wenhao Zheng,Huichi Zhou,Jindong Gu,Zhaorun Chen,Peng Xia,Tony Lee,Thomas Zollo,Vikash Sehwag,Jixuan Leng,Jiuhai Chen,Yuxin Wen,Huan Zhang,Zhun Deng,Linjun Zhang,Pavel Izmailov,Pang Wei Koh,Yulia Tsvetkov,Andrew Wilson,Jiaheng Zhang,James Zou,Cihang Xie,Hao Wang,Philip Torr,Julian McAuley,David Alvarez-Melis,Florian Tramèr,Kaidi Xu,Suman Jana,Chris Callison-Burch,Rene Vidal,Filippos Kokkinos,Mohit Bansal,Beidi Chen,Huaxiu Yao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:Multimodal Large Language, Large Language Models, Image Generative Models, Video Generative Models, including Large Language

备注: TMLR camera-ready version

点击查看摘要

Abstract:Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.

74. 【2602.08128】Online Bayesian Imbalanced Learning with Bregman-Calibrated Deep Networks

链接https://arxiv.org/abs/2602.08128

作者:Zahir Alsulaimawi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:standard classifiers exhibit, Online Bayesian Imbalanced, classifiers exhibit severe, minority classes, Bayesian Imbalanced Learning

备注

点击查看摘要

Abstract:Class imbalance remains a fundamental challenge in machine learning, where standard classifiers exhibit severe performance degradation in minority classes. Although existing approaches address imbalance through resampling or cost-sensitive learning during training, they require retraining or access to labeled target data when class distributions shift at deployment time, a common occurrence in real-world applications such as fraud detection, medical diagnosis, and anomaly detection. We present \textit{Online Bayesian Imbalanced Learning} (OBIL), a principled framework that decouples likelihood-ratio estimation from class-prior assumptions, enabling real-time adaptation to distribution shifts without model retraining. Our approach builds on the established connection between Bregman divergences and proper scoring rules to show that deep networks trained with such losses produce posterior probability estimates from which prior-invariant likelihood ratios can be extracted. We prove that these likelihood-ratio estimates remain valid under arbitrary changes in class priors and cost structures, requiring only a threshold adjustment for optimal Bayes decisions. We derive finite-sample regret bounds demonstrating that OBIL achieves $O(\sqrt{T \log T})$ regret against an oracle with perfect prior knowledge. Extensive experiments on benchmark datasets and medical diagnosis benchmarks under simulated deployment shifts demonstrate that OBIL maintains robust performance under severe distribution shifts, outperforming state-of-the-art methods in F1 Score when test distributions deviate significantly from the training conditions.

75. 【2602.08124】Gender and Race Bias in Consumer Product Recommendations by Large Language Models

链接https://arxiv.org/abs/2602.08124

作者:Ke Xu,Shera Potka,Alex Thomo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, biases remains underexplored, generating consumer product

备注: Accepted at the 39th International Conference on Advanced Information Networking and Applications (AINA 2025)

点击查看摘要

Abstract:Large Language Models are increasingly employed in generating consumer product recommendations, yet their potential for embedding and amplifying gender and race biases remains underexplored. This paper serves as one of the first attempts to examine these biases within LLM-generated recommendations. We leverage prompt engineering to elicit product suggestions from LLMs for various race and gender groups and employ three analytical methods-Marked Words, Support Vector Machines, and Jensen-Shannon Divergence-to identify and quantify biases. Our findings reveal significant disparities in the recommendations for demographic groups, underscoring the need for more equitable LLM recommendation systems.

76. 【2602.08100】Emergent Search and Backtracking in Latent Reasoning Models

链接https://arxiv.org/abs/2602.08100

作者:Jasmine Cui,Charles Ye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:language model, Abstract, model, latent, space

备注

点击查看摘要

Abstract:What happens when a language model thinks without words? Standard reasoning LLMs verbalize intermediate steps as chain-of-thought; latent reasoning transformers (LRTs) instead perform deliberation entirely in continuous hidden space. We investigate an LRT, decoding the model's evolving beliefs at every step on a multiple-choice QA benchmark. We find that the model spontaneously learns a structured search process in latent space. Deliberation follows a consistent trajectory: an exploration phase where probability mass spreads across candidates, tentative commitment to a frontrunner, and either convergence or backtracking. Backtracking is prevalent (32% of instances), beneficial (34% accuracy gain over non-backtracking instances), and predominantly directed away from the semantically closest distractor toward the correct answer. The search is adaptive: replacing distractors with implausible alternatives shortens exploration by 54%. Latent reasoning models achieve in activation space what chain-of-thought achieves through words: the ability to be wrong, notice, and recover.

77. 【2602.08064】SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

链接https://arxiv.org/abs/2602.08064

作者:Tianyu Li,Dongchen Han,Zixuan Cao,Haofeng Huang,Mengyu Zhou,Ming Chen,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang,Gao Huang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Modern Transformers predominantly, Transformers predominantly adopt, Modern Transformers, Transformers predominantly, foregoing the superior

备注

点击查看摘要

Abstract:Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at this https URL.

78. 【2602.08048】DGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs

链接https://arxiv.org/abs/2602.08048

作者:Arshia Hemmat,Philip Torr,Yongqiang Chen,Junchi Yu

类目:Computation and Language (cs.CL)

关键词:D-LLMs remains underexplored, offer parallel denoising, Diffusion language models, offer parallel, bidirectional context

备注

点击查看摘要

Abstract:Diffusion language models (D-LLMs) offer parallel denoising and bidirectional context, but hallucination detection for D-LLMs remains underexplored. Prior detectors developed for auto-regressive LLMs typically rely on single-pass cues and do not directly transfer to diffusion generation, where factuality evidence is distributed across the denoising trajectory and may appear, drift, or be self-corrected over time. We introduce TDGNet, a temporal dynamic graph framework that formulates hallucination detection as learning over evolving token-level attention graphs. At each denoising step, we sparsify the attention graph and update per-token memories via message passing, then apply temporal attention to aggregate trajectory-wide evidence for final prediction. Experiments on LLaDA-8B and Dream-7B across QA benchmarks show consistent AUROC improvements over output-based, latent-based, and static-graph baselines, with single-pass inference and modest overhead. These results highlight the importance of temporal reasoning on attention graphs for robust hallucination detection in diffusion language models.

79. 【2602.08041】Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments

链接https://arxiv.org/abs/2602.08041

作者:Boyang Xia,Weiyou Tian,Qingnan Ren,Jiaqi Huang,Jie Xiao,Shuo Lu,Kai Wang,Lynn Ai,Eric Yang,Bill Shi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Training large language, large language model, Training large, large language, adversarial games

备注

点击查看摘要

Abstract:Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long-horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation-based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction-aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long-run strategic value of actions with iso-grpo, a context-conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static-game rates obtained when strategic externalities are known. Experiments in 6-player No-Limit Texas Hold'em and competitive Pokemon show consistent improvements in long-term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.

80. 【2602.08031】Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection

链接https://arxiv.org/abs/2602.08031

作者:Chenwang Wu,Yiu-ming Cheung,Shuhai Zhang,Bo Han,Defu Lian

类目:Computation and Language (cs.CL)

关键词:offer great convenience, machine-generated texts, offer great, great convenience, disinformation and phishing

备注

点击查看摘要

Abstract:While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean-field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real-world scenarios, such as cross-LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at this https URL.

81. 【2602.08030】Free(): Learning to Forget in Malloc-Only Reasoning Models

链接https://arxiv.org/abs/2602.08030

作者:Yilun Zheng,Dongyang Ma,Tian Liang,Jiahao Xu,Xinting Huang,Lijie Chen,Haitao Mi,Yan Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:scaling test-time compute, excessive thinking tokens, models enhance problem-solving, test-time compute, critical paradox

备注

点击查看摘要

Abstract:Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as "malloc-only" engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state. Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to 50%. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.08030 [cs.AI]

(or
arXiv:2602.08030v2 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.08030

Focus to learn more

              arXiv-issued DOI via DataCite</p>
82. 【2602.08028】Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

链接https://arxiv.org/abs/2602.08028

作者:Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen

类目:Computation and Language (cs.CL)

关键词:recent methods guide, large language models, methods guide large, guide large language, paths in standard

备注: Accepted to Findings of IJCNLP-AACL 2025

点击查看摘要

Abstract:To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

83. 【2602.08024】FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

链接https://arxiv.org/abs/2602.08024

作者:Ziyang Fan,Keyu Chen,Ruilong Xing,Yulin Li,Li Jiang,Zhuotao Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Video Large Language, Language Models, Large Language, shown remarkable capabilities

备注: Accepted by ICLR 2026 (Oral)

点击查看摘要

Abstract:Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at this https URL.

84. 【2602.08009】owards Adaptive, Scalable, and Robust Coordination of LLM Agents: A Dynamic Ad-Hoc Networking Perspective

链接https://arxiv.org/abs/2602.08009

作者:Rui Li,Zeyu Zhang,Xiaohe Bo,Quanyu Dai,Chaozhuo Li,Feng Wen,Xu Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, realize swarm intelligence, Multi-agent architectures built, language models, well-crafted collaboration

备注

点击查看摘要

Abstract:Multi-agent architectures built on large language models (LLMs) have demonstrated the potential to realize swarm intelligence through well-crafted collaboration. However, the substantial burden of manual orchestration inherently raises an imperative to automate the design of agentic workflows. We frame such an agent coordination challenge as a classic problem in dynamic ad-hoc networking: How to establish adaptive and reliable communication among a scalable number of agentic hosts? In response to this unresolved dilemma, we introduce RAPS, a reputation-aware publish-subscribe paradigm for adaptive, scalable, and robust coordination of LLM agents. RAPS is grounded in the Distributed Publish-Subscribe Protocol, allowing LLM agents to exchange messages based on their declared intents rather than predefined topologies. Beyond this substrate, RAPS further incorporates two coherent overlays: (i) Reactive Subscription, enabling agents to dynamically refine their intents; and (ii) Bayesian Reputation, empowering each agent with a local watchdog to detect and isolate malicious peers. Extensive experiments over five benchmarks showcase that our design effectively reconciles adaptivity, scalability, and robustness in a unified multi-agent coordination framework.

85. 【2602.08005】DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

链接https://arxiv.org/abs/2602.08005

作者:Jitai Hao,Qiang Huang,Yaowei Wang,Min Zhang,Jun Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:long-chain reasoning, autonomous agents, applications like autonomous, creative writing, writing is fundamentally

备注: preprint

点击查看摘要

Abstract:The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at this https URL.

86. 【2602.07996】he Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation

链接https://arxiv.org/abs/2602.07996

作者:Arash Marioriyad,Omid Ghahroodi,Ehsaneddin Asgari,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah

类目:Computation and Language (cs.CL)

关键词:evaluate system outputs, question answering, evaluate system, system outputs, outputs in tasks

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert Human LLM Unknown), recency preferences (New Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.

87. 【2602.07983】Accelerating Social Science Research via Agentic Hypothesization and Experimentation

链接https://arxiv.org/abs/2602.07983

作者:Jishu Sen Gupta,Harini SI,Somesh Kumar Singh,Syed Mohamad Tawseeq,Yaman Kumar Singla,David Doermann,Rajiv Ratn Shah,Balaji Krishnamurthy

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Data-driven social science, social science research, hypothesis generation, inherently slow, relying on iterative

备注

点击查看摘要

Abstract:Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.

88. 【2602.07978】Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

链接https://arxiv.org/abs/2602.07978

作者:Rui Feng,Zhiyao Luo,Liuyu Wu,Wei Wang,Yuting Song,Yong Liu,Kok Pin Ng,Jianqing Li,Xingyao Wang

类目:Computation and Language (cs.CL)

关键词:Mild Cognitive Impairment, Speech-based digital biomarkers, digital biomarkers represent, Speech-based digital, identification of Mild

备注: 18 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.

89. 【2602.07963】Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms

链接https://arxiv.org/abs/2602.07963

作者:Vaibhav Shukla,Hardik Sharma,Adith N Reganti,Soham Wasmatkar,Bagesh Kumar,Vrijendra Singh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remain anchored, English, Abstract, LLMs, safety

备注: Accepted at the AICS Workshop, AAAI 2026

点击查看摘要

Abstract:Most safety evaluations of large language models (LLMs) remain anchored in English. Translation is often used as a shortcut to probe multilingual behavior, but it rarely captures the full picture, especially when harmful intent or structure morphs across languages. Some types of harm survive translation almost intact, while others distort or disappear. To study this effect, we introduce CompositeHarm, a translation-based benchmark designed to examine how safety alignment holds up as both syntax and semantics shift. It combines two complementary English datasets, AttaQ, which targets structured adversarial attacks, and MMSafetyBench, which covers contextual, real-world harms, and extends them into six languages: English, Hindi, Assamese, Marathi, Kannada, and Gujarati. Using three large models, we find that attack success rates rise sharply in Indic languages, especially under adversarial syntax, while contextual harms transfer more moderately. To ensure scalability and energy efficiency, our study adopts lightweight inference strategies inspired by edge-AI design principles, reducing redundant evaluation passes while preserving cross-lingual fidelity. This design makes large-scale multilingual safety testing both computationally feasible and environmentally conscious. Overall, our results show that translated benchmarks are a necessary first step, but not a sufficient one, toward building grounded, resource-aware, language-adaptive safety systems.

90. 【2602.07954】Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

链接https://arxiv.org/abs/2602.07954

作者:Krzysztof Wróbel,Jan Maria Kowalski,Jerzy Surma,Igor Ciuciura,Maciej Szymański

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Polish language applications, Large Language Models, Polish language safety, Large Language, Polish language

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65\%) and very low false positive rate (0.63\%) on real user prompts, outperforming HerBERT-PL-Guard (31.55\% precision, 4.70\% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

91. 【2602.07930】Patches of Nonlinearity: Instruction Vectors in Large Language Models

链接https://arxiv.org/abs/2602.07930

作者:Irina Bigoulaeva,Jonas Rohweder,Subhabrata Dutta,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:Direct Preference Optimization, process instructions internally, ubiquitous usage, models process instructions, recent success

备注

点击查看摘要

Abstract:Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.

92. 【2602.07909】SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

链接https://arxiv.org/abs/2602.07909

作者:Taolin Zhang,Hang Guo,Wang Lu,Tao Dai,Shu-Tao Xia,Jindong Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, language models, continue to scale, significantly improved, Candidate Importance Score

备注: ICLR2026

点击查看摘要

Abstract:As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {this https URL}.

93. 【2602.07892】Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

链接https://arxiv.org/abs/2602.07892

作者:Guanglong Sun,Siyuan Zhang,Liyuan Wang,Jun Zhu,Hang Su,Yi Zhong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, reasoning and coding, OGPSA

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{this https URL}{OGPSA}

94. 【2602.07852】Emergent Misalignment is Easy, Narrow Misalignment is Hard

链接https://arxiv.org/abs/2602.07852

作者:Anna Soligo,Edward Turner,Senthooran Rajamanoharan,Neel Nanda

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Finetuning large language, diverse unrelated settings, Finetuning large, large language models, narrowly harmful datasets

备注: Published at ICLR 2026

点击查看摘要

Abstract:Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.

95. 【2602.07842】Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

链接https://arxiv.org/abs/2602.07842

作者:Yuhan Wang,Shiyu Ni,Zhikai Ding,Zihang Zhan,Yuanzi Li,Keping Bi

类目:Computation and Language (cs.CL)

关键词:large language models, making large language, existing training-free methods, language models, essential for making

备注

点击查看摘要

Abstract:Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

96. 【2602.07839】odoEvolve: Learning to Architect Agent Planning Systems

链接https://arxiv.org/abs/2602.07839

作者:Jiaxi Liu,Yanzuo Jiang,Guibin Zhang,Zihan Zhang,Heng Chang,Zhenfei Yin,Qibing Ren,Junchi Yan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:existing approaches predominantly, approaches predominantly rely, hand-crafted planning structures, navigating complex, rely on fixed

备注

点击查看摘要

Abstract:Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.

97. 【2602.07833】SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

链接https://arxiv.org/abs/2602.07833

作者:Weijiang Lv,Yaoxuan Feng,Xiaobo Xia,Jiayu Wang,Yan Jing,Wenchao Chen,Bo Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, multimodal large language, traces remains unclear, language models, remains unclear

备注: 53 pages, 42 figures, 14 tables

点击查看摘要

Abstract:Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at this https URL.

98. 【2602.07824】Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

链接https://arxiv.org/abs/2602.07824

作者:Yiwei Qin,Zhen Huang,Tiantian Mi,Weiye Si,Chenyang Zhou,Qipeng Guo,Siyuan Feng,Pengfei Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:quality determines foundation, Data quality determines, foundation model performance, determines foundation model, introduce Data Darwinism

备注

点击查看摘要

Abstract:Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.07824 [cs.AI]

(or
arXiv:2602.07824v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.07824

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
99. 【2602.07812】LLMs Know More About Numbers than They Can Say

链接https://arxiv.org/abs/2602.07812

作者:Fengting Yuchi,Li Du,Jason Eisner

类目:Computation and Language (cs.CL)

关键词:solve math problems, math problems, mixed notation, solve math, comparisons with mixed

备注: EACL 2026

点击查看摘要

Abstract:Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.

100. 【2602.07804】Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

链接https://arxiv.org/abs/2602.07804

作者:Xuan Ding,Pengyu Tong,Ranjie Duan,Yunjian Zhang,Rui Sun,Yao Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:high computational demands, deployment in real-world, real-world scenarios, constrained by high, large language models

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.

101. 【2602.07796】hinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

链接https://arxiv.org/abs/2602.07796

作者:Jiatong Li,Changdae Oh,Hyeong Kyu Choi,Jindong Wang,Sharon Li

类目:Computation and Language (cs.CL)

关键词:large language models, Eliciting reasoning, powerful technique, technique for improving, large language

备注: 27 pages, 19 figures

点击查看摘要

Abstract:Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at this https URL.

102. 【2602.07794】Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models

链接https://arxiv.org/abs/2602.07794

作者:Ningyu Xu,Qi Zhang,Xipeng Qiu,Xuanjing Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, exhibit emergent behaviors, emergent behaviors suggestive, Large language, exhibit emergent

备注: 27 pages, 16 figures

点击查看摘要

Abstract:Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.

103. 【2602.07778】Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

链接https://arxiv.org/abs/2602.07778

作者:Shenglai Zeng,Tianqi Zheng,Chuan Tian,Dante Everaert,Yau-Shian Wang,Yupin Huang,Michael J. Morais,Rohit Patki,Jinjin Tian,Xinnan Dai,Kai Guo,Monica Xiao Cheng,Hui Liu

类目:Computation and Language (cs.CL)

关键词:Personalizing large language, high inference latency, large language models, API costs, Personalizing large

备注

点击查看摘要

Abstract:Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.

104. 【2602.07773】SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

链接https://arxiv.org/abs/2602.07773

作者:Chen Zhang,Kuicai Dong,Dexun Li,Wenjun Li,Qu Yang,Wei Han,Yong Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:complex question answering, Recent deep search, Recent deep, excel at complex, iteratively planning

备注

点击查看摘要

Abstract:Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

105. 【2602.07721】ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

链接https://arxiv.org/abs/2602.07721

作者:Yanlin Qi,Xinhang Chen,Huiqiang Jiang,Qitong Wang,Botao Peng,Themis Palpanas

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Databases (cs.DB)

关键词:long-context LLM inference, existing methods struggle, LLM inference, Unified Virtual Addressing, long-context LLM

备注: 25 pages, 16 figures. Under review

点击查看摘要

Abstract:KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines.

106. 【2602.07698】On Sequence-to-Sequence Models for Automated Log Parsing

链接https://arxiv.org/abs/2602.07698

作者:Adam Sorrenti,Andriy Miranskyy

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:critical standard operating, standard operating procedure, automated log parsing, Log parsing, enabling monitoring

备注

点击查看摘要

Abstract:Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

107. 【2602.07695】EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge

链接https://arxiv.org/abs/2602.07695

作者:Congcong Hu,Yuang Shi,Fan Huang,Yang Xiang,Zhou Ye,Ming Jin,Shiyu Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:impacting inventory planning, directly impacting inventory, fulfillment scheduling, impacting inventory, inventory planning

备注

点击查看摘要

Abstract:Demand forecasting is a cornerstone of e-commerce operations, directly impacting inventory planning and fulfillment scheduling. However, existing forecasting systems often fail during high-impact periods such as flash sales, holiday campaigns, and sudden policy interventions, where demand patterns shift abruptly and unpredictably. In this paper, we introduce EventCast, a modular forecasting framework that integrates future event knowledge into time-series prediction. Unlike prior approaches that ignore future interventions or directly use large language models (LLMs) for numerical forecasting, EventCast leverages LLMs solely for event-driven reasoning. Unstructured business data, which covers campaigns, holiday schedules, and seller incentives, from existing operational databases, is processed by an LLM that converts it into interpretable textual summaries leveraging world knowledge for cultural nuances and novel event combinations. These summaries are fused with historical demand features within a dual-tower architecture, enabling accurate, explainable, and scalable forecasts. Deployed on real-world e-commerce scenarios spanning 4 countries of 160 regions over 10 months, EventCast achieves up to 86.9% and 97.7% improvement on MAE and MSE compared to the variant without event knowledge, and reduces MAE by up to 57.0% and MSE by 83.3% versus the best industrial baseline during event-driven periods. EventCast has deployed into real-world industrial pipelines since March 2025, offering a practical solution for improving operational decision-making in dynamic e-commerce environments.

108. 【2602.07673】Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

链接https://arxiv.org/abs/2602.07673

作者:Jiangnan Fang,Cheng-Tse Liu,Hanieh Deilamsalehy,Nesreen K. Ahmed,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi

类目:Computation and Language (cs.CL)

关键词:capture semantic information, Large language model, Large language, alongside traditional, semantic information

备注

点击查看摘要

Abstract:Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.

109. 【2602.07639】Letting Tutor Personas "Speak Up" for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

链接https://arxiv.org/abs/2602.07639

作者:Jaewook Lee,Alexander Scarlatos,Simon Woodhead,Andrew Lan

类目:Computation and Language (cs.CL)

关键词:generative artificial intelligence, large language models, artificial intelligence, increasingly prominent, emergence of large

备注

点击查看摘要

Abstract:With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We modify Bidirectional Preference Optimization (BiPO) to learn a steering vector, an activation-space direction that steers model responses towards certain tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned directional coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.

110. 【2602.07621】SciClaimEval: Cross-modal Claim Verification in Scientific Papers

链接https://arxiv.org/abs/2602.07621

作者:Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Tian Cheng Xia,Florian Boudin,Andre Greiner-Petter,Akiko Aizawa

类目:Computation and Language (cs.CL)

关键词:claim verification task, present SciClaimEval, verification task, scientific dataset, Unlike existing resources

备注: 12 pages; data is available at [this https URL](https://sciclaimeval.github.io/)

点击查看摘要

Abstract:We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.

111. 【2602.07594】Learning to Self-Verify Makes Language Models Better Reasoners

链接https://arxiv.org/abs/2602.07594

作者:Yuxin Chen,Yu Wang,Yi Zhang,Ziang Ye,Zhengzhou Cai,Yaorui Shi,Qi Gu,Hui Su,Xunliang Cai,Xiang Wang,An Zhang,Tat-Seng Chua

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Recent large language, Recent large, achieve strong performance, generating promising reasoning, promising reasoning paths

备注

点击查看摘要

Abstract:Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.

112. 【2602.07574】ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

链接https://arxiv.org/abs/2602.07574

作者:Wenjie Liu,Hao Wu,Xin Qiu,Yingqi Fan,Yihan Zhang,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Modern multimodal large, incurring substantial computational, large language models, unified self-attention design, Transformer layer

备注

点击查看摘要

Abstract:Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at this https URL.

113. 【2602.07549】When Is Enough Not Enough? Illusory Completion in Search Agents

链接https://arxiv.org/abs/2602.07549

作者:Dayoon Ko,Jihyuk Kim,Sohyeon Kim,Haeju Park,Dahyun Lee,Gunhee Kim,Moontae Lee,Kyungjae Lee

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent search agents, Recent search, search agents leverage, achieve strong performance, long-horizon benchmarks

备注

点击查看摘要

Abstract:Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.

114. 【2602.07546】Improving Variable-Length Generation in Diffusion Language Models via Length Regularization

链接https://arxiv.org/abs/2602.07546

作者:Zicong Cheng,Ruixuan Jia,Jia Li,Guo-Wei Yang,Meng-Hao Guo,Shi-Min Hu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Diffusion Large Language, Large Language Models, Diffusion Large, Language Models, Large Language

备注: diffusion language models

点击查看摘要

Abstract:Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).

115. 【2602.07517】MemPot: Defending Against Memory Extraction Attack with Optimized Honeypots

链接https://arxiv.org/abs/2602.07517

作者:Yuhao Wang,Shengfang Zhai,Guanghao Jin,Yinpeng Dong,Linyi Yang,Jiaheng Zhang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词:Large Language Model, Large Language, defenses remain lacking, effective defenses remain, internal memory systems

备注

点击查看摘要

Abstract:Large Language Model (LLM)-based agents employ external and internal memory systems to handle complex, goal-oriented tasks, yet this exposes them to severe extraction attacks, and effective defenses remain lacking. In this paper, we propose MemPot, the first theoretically verified defense framework against memory extraction attacks by injecting optimized honeypots into the memory. Through a two-stage optimization process, MemPot generates trap documents that maximize the retrieval probability for attackers while remaining inconspicuous to benign users. We model the detection process as Wald's Sequential Probability Ratio Test (SPRT) and theoretically prove that MemPot achieves a lower average number of sampling rounds compared to optimal static detectors. Empirically, MemPot significantly outperforms state-of-the-art baselines, achieving a 50% improvement in detection AUROC and an 80% increase in True Positive Rate under low False Positive Rate constraints. Furthermore, our experiments confirm that MemPot incurs zero additional online inference latency and preserves the agent's utility on standard tasks, verifying its superiority in safety, harmlessness, and efficiency.

116. 【2602.07499】Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification

链接https://arxiv.org/abs/2602.07499

作者:Jingshen Zhang,Xin Ying Qiu,Lifang Lu,Zhuhua Huang,Yutao Hu,Yuechang Wu,JunYu Lu

类目:Computation and Language (cs.CL)

关键词:large readability levels, models demonstrate limited, demonstrate limited capability, Large language models, language models demonstrate

备注: Accepted to EACL 2026 Findings

点击查看摘要

Abstract:Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.

117. 【2602.07497】From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection

链接https://arxiv.org/abs/2602.07497

作者:Mo Wang,Kaixuan Ren,Pratik Jalan,Ahmed Ashraf,Tuong Vy Vu,Rahul Seetharaman,Shah Nawaz,Usman Naseem

类目:Computation and Language (cs.CL)

关键词:Cultural context profoundly, interpret online content, remain predominantly trained, context profoundly shapes, people interpret online

备注: 12 pages, 5 figures, Proceedings of the ACM Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.

118. 【2602.07465】On the Importance of a Multi-Scale Calibration for Quantization

链接https://arxiv.org/abs/2602.07465

作者:Seungwoo Son,Ingyu Seong,Junhan Kim,Hyemi Jang,Yongkweon Jeon

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, efficiently deploying large, deploying large language, set critically affects, small calibration set

备注: ICASSP 2026

点击查看摘要

Abstract:Post-training quantization (PTQ) is a cornerstone for efficiently deploying large language models (LLMs), where a small calibration set critically affects quantization performance. However, conventional practices rely on random sequences of fixed length, overlooking the variable-length nature of LLM inputs. Input length directly influences the activation distribution and, consequently, the weight importance captured by the Hessian, which in turn affects quantization outcomes. As a result, Hessian estimates derived from fixed-length calibration may fail to represent the true importance of weights across diverse input scenarios. We propose MaCa (Matryoshka Calibration), a simple yet effective method for length-aware Hessian construction. MaCa (i) incorporates multi-scale sequence length information into Hessian estimation and (ii) regularizes each sequence as an independent sample, yielding a more stable and fruitful Hessian for accurate quantization. Experiments on state-of-the-art LLMs (e.g., Qwen3, Gemma3, LLaMA3) demonstrate that MaCa consistently improves accuracy under low bit quantization, offering a lightweight enhancement compatible with existing PTQ frameworks. To the best of our knowledge, this is the first work to systematically highlight the role of multi-scale calibration in LLM quantization.

119. 【2602.07464】SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

链接https://arxiv.org/abs/2602.07464

作者:Yijie Chen,Yijin Liu,Fandong Meng

类目:Computation and Language (cs.CL)

关键词:Reinforcement Learning, large language models, Supervised Fine-Tuning, standard post-training paradigm, post-training paradigm

备注: The code is publicly available at [this https URL](https://github.com/pppa2019/SED-SFT)

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at this https URL

120. 【2602.07457】Pull Requests as a Training Signal for Repo-Level Code Editing

链接https://arxiv.org/abs/2602.07457

作者:Qinglin Zhu,Tianyu Chen,Shuai Lu,Lei Ji,Runcong Zhao,Murong Ma,Xiangxiang Dai,Yulan He,Lin Gui,Peng cheng,Yeyun Gong

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:execute precise multi-file, precise multi-file modifications, understand complex dependencies, large codebase, dependencies and execute

备注

点击查看摘要

Abstract:Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.

121. 【2602.07451】DLLM Agent: See Farther, Run Faster

链接https://arxiv.org/abs/2602.07451

作者:Huiling Zhen,Weizhe Lin,Renxi Liu,Kai Han,Yiming Li,Yuchuan Tian,Hanting Chen,Xiaoguang Li,Xiaosong Li,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Youliang Yan,Peifeng Qin,Jun Wang,Yu Wang,Dacheng Tao,Yunhe Wang

类目:Computation and Language (cs.CL)

关键词:large language models, making remain underexplored, agentic multi-step decision, multi-step decision making, decision making remain

备注

点击查看摘要

Abstract:Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.

122. 【2602.07447】Measuring cross-language intelligibility between Romance languages with computational tools

链接https://arxiv.org/abs/2602.07447

作者:Liviu P Dinu,Ana Sabina Uban,Bogdan Iordache,Anca Dinu,Simona Georgescu

类目:Computation and Language (cs.CL)

关键词:Romance family, related languages applied, main Romance languages, present an analysis, measure mutual intelligibility

备注: 16 pages, 7 figures, 2 tables

点击查看摘要

Abstract:We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.

123. 【2602.07425】Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

链接https://arxiv.org/abs/2602.07425

作者:Dingzhi Yu,Hongyi Tao,Yuanyu Wan,Luo Luo,Lijun Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)

关键词:modern machine learning, recently demonstrated superior, training large language, demonstrated superior empirical, superior empirical performance

备注: Code available at [this https URL](https://github.com/Dingzhen230/Heavy-tailed-Noise-in-LLMs)

点击查看摘要

Abstract:While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

124. 【2602.07422】Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

链接https://arxiv.org/abs/2602.07422

作者:Tianyi Wu,Mingzhe Du,Yue Liu,Chengran Yang,Terry Yue Zhuo,Jiaheng Zhang,See-Kiong Ng

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, insecure code remains, software development, real-world deployment

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality--security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality-grounded vulnerability-inducing coding tasks for online RL rollouts, and (ii) training a reasoning-based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state-of-the-art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14-54%. We release our code, dataset and model checkpoints at this https URL.

125. 【2602.07414】Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution

链接https://arxiv.org/abs/2602.07414

作者:Deuksin Kwon,Kaleen Shrestha,Bin Han,Spencer Lin,James Hale,Jonathan Gratch,Maja Matarić,Gale M. Lucas

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, language models, legal mediation, simulate human behavior

备注: AAAI 2026 (Special Track: AISI)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human behavior in social settings such as legal mediation, negotiation, and dispute resolution. However, it remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans. Human personality, for instance, shapes how individuals navigate social interactions, including strategic choices and behaviors in emotionally charged interactions. This raises the question: Can LLMs, when prompted with personality traits, reproduce personality-driven differences in human conflict behavior? To explore this, we introduce an evaluation framework that enables direct comparison of human-human and LLM-LLM behaviors in dispute resolution dialogues with respect to Big Five Inventory (BFI) personality traits. This framework provides a set of interpretable metrics related to strategic behavior and conflict outcomes. We additionally contribute a novel dataset creation methodology for LLM dispute resolution dialogues with matched scenarios and personality traits with respect to human conversations. Finally, we demonstrate the use of our evaluation framework with three contemporary closed-source LLMs and show significant divergences in how personality manifests in conflict across different LLMs compared to human data, challenging the assumption that personality-prompted agents can serve as reliable behavioral proxies in socially impactful applications. Our work highlights the need for psychological grounding and validation in AI simulations before real-world use.

126. 【2602.07382】Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi

链接https://arxiv.org/abs/2602.07382

作者:Debtanu Datta,Rajdeep Mukherjee,Adrijit Goswami,Saptarshi Ghosh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Summarizing Indian legal, Summarizing Indian, Indian legal court, legal court judgments, Indian legal

备注: 19 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.

127. 【2602.07381】When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

链接https://arxiv.org/abs/2602.07381

作者:Gautam Siddharth Kashyap,Mark Dras,Usman Naseem

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, human values-being helpful, values-being helpful

备注: Accepted at EACL Mains 2026

点击查看摘要

Abstract:Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.

128. 【2602.07376】Do Large Language Models Reflect Demographic Pluralism in Safety?

链接https://arxiv.org/abs/2602.07376

作者:Usman Naseem,Gautam Siddharth Kashyap,Sushant Kumar Ray,Rafiq Ali,Ebad Shabbir,Abdullah Mohammad

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, cultural expectations, moral norms

备注: Accepted at EACL Findings 2026

点击查看摘要

Abstract:Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.

129. 【2602.07375】Efficient Post-Training Pruning of Large Language Models with Statistical Correction

链接https://arxiv.org/abs/2602.07375

作者:Peiqi Yu,Jinhao Wang,Xinyi Sui,Nam Ling,Wei Wang,Wei Jiang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, size and inference, large language, face a trade-off, Post-training pruning

备注: 11 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.

130. 【2602.07374】rnaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling

链接https://arxiv.org/abs/2602.07374

作者:Nisharg Nargund,Priyesh Shukla

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:achieve remarkable performance, substantial computational resources, Large language models, demand substantial computational, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at this https URL.

131. 【2602.07361】ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations

链接https://arxiv.org/abs/2602.07361

作者:Long S. T. Nguyen,Quan M. Bui,Tin T. Ngo,Quynh T. N. Vo,Dung N. H. Le,Tho T. Quan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Question Answering, legally interdependent texts, Vietnamese healthcare regulatory, Vietnamese Healthcare, inherently challenging due

备注: Accepted at ACIIDS 2026

点击查看摘要

Abstract:Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at this https URL.

132. 【2602.07338】Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

链接https://arxiv.org/abs/2602.07338

作者:Geng Liu,Fei Zhu,Rong Feng,Changyi Ma,Shiqi Wang,Gaofeng Meng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, paradigm for Large, predominant interaction paradigm, Language Models

备注

点击查看摘要

Abstract:Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation'' (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.

133. 【2602.07333】High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

链接https://arxiv.org/abs/2602.07333

作者:Rajat Arora,Ye Tao,Jianqiang Shen,Ping Liu,Muchen Wu,Qianqi Shen,Benjamin Le,Fedor Borisyuk,Jingwei Wu,Wenjing Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:search activity logs, Large Language Models, Effective personalization, including profiles, professional data

备注

点击查看摘要

Abstract:Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world's largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.

134. 【2602.07319】Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

链接https://arxiv.org/abs/2602.07319

作者:Savan Doshi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:medical question answering, Large language models, Large language, question answering, outputs can vary

备注

点击查看摘要

Abstract:Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.

135. 【2602.07276】Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs

链接https://arxiv.org/abs/2602.07276

作者:Pengrui Han,Xueqiang Xu,Keyang Xuan,Peiyang Song,Siru Ouyang,Runchu Tian,Yuqing Jiang,Cheng Qian,Pengcheng Jiang,Jiashuo Sun,Junxia Cui,Ming Zhong,Ge Liu,Jiawei Han,Jiaxuan You

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:efficiently adapting large, adapting large language, large language models, Activation steering, downstream behaviors

备注

点击查看摘要

Abstract:Activation steering has emerged as a promising approach for efficiently adapting large language models (LLMs) to downstream behaviors. However, most existing steering methods rely on a single static direction per task or concept, making them inflexible under task variation and inadequate for complex tasks that require multiple coordinated capabilities. To address this limitation, we propose STEER2ADAPT, a lightweight framework that adapts LLMs by composing steering vectors rather than learning new ones from scratch. In many domains (e.g., reasoning or safety), tasks share a small set of underlying concept dimensions. STEER2ADAPT captures these dimensions as a reusable, low-dimensional semantic prior subspace, and adapts to new tasks by dynamically discovering a linear combination of basis vectors from only a handful of examples. Experiments across 9 tasks and 3 models in both reasoning and safety domains demonstrate the effectiveness of STEER2ADAPT, achieving an average improvement of 8.2%. Extensive analyses further show that STEER2ADAPT is a data-efficient, stable, and transparent inference-time adaptation method for LLMs.

136. 【2602.07267】BRIDGE: Predicting Human Task Completion Time From Model Performance

链接https://arxiv.org/abs/2602.07267

作者:Fengyuan Liu,Jay Gala,Nilaksh,Dzmitry Bahdanau,Siva Reddy,Hugo Larochelle

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:systems requires grounding, Evaluating the real-world, task completion time, human task completion, requires grounding benchmark

备注

点击查看摘要

Abstract:Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

137. 【2602.07253】From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

链接https://arxiv.org/abs/2602.07253

作者:Litian Liu,Reza Pourreza,Yubing Jian,Yao Qin,Roland Memisevic

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:critical open problem, Detecting hallucinations, critical open, significant implications, large language models

备注

点击查看摘要

Abstract:Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

138. 【2602.07211】Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

链接https://arxiv.org/abs/2602.07211

作者:Ju Lin,Jing Pan,Ruizhi Li,Ming Sun,Yuzong Liu,Alaa Hassan,Jing Zheng,Florian Metze

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:large language models, prompting large language, Recent studies, speech understanding capabilities, audio encodings enables

备注

点击查看摘要

Abstract:Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.

139. 【2602.07190】Long-Context Long-Form Question Answering for Legal Domain

链接https://arxiv.org/abs/2602.07190

作者:Anagha Kulkarni,Parin Rajesh Jhaveri,Prasha Shrestha,Yu Tong Han,Reza Amini,Behrouz Madahian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:involving multiple nested, specialized linguistic devices, layouts involving multiple, multiple nested sections, Legal documents

备注: EACL 2026

点击查看摘要

Abstract:Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer). In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. We curate a QA dataset by leveraging the expertise of professionals from fields such as law and corporate tax. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.

140. 【2602.07182】Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors

链接https://arxiv.org/abs/2602.07182

作者:Maximilian Vierlboeck,Antonio Pugliese,Roshanak Nilchian,Paul Grogan,Rashika Sugganahalli Natesh Babu

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:driving cost overruns, outright project failures, engineered systems presents, schedule delays, cost overruns

备注: 16 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Complexity in engineered systems presents one of the most persistent challenges in modern development since it is driving cost overruns, schedule delays, and outright project failures. Yet while architectural complexity has been studied, the structural complexity embedded within requirements specifications remains poorly understood and inadequately quantified. This gap is consequential: requirements fundamentally drive system design, and complexity introduced at this stage propagates through architecture, implementation, and integration. To address this gap, we build on Natural Language Processing methods that extract structural networks from textual requirements. Using these extracted structures, we conducted a controlled experiment employing molecular integration tasks as structurally isomorphic proxies for requirements integration - leveraging the topological equivalence between molecular graphs and requirement networks while eliminating confounding factors such as domain expertise and semantic ambiguity. Our results demonstrate that spectral measures predict integration effort with correlations exceeding 0.95, while structural metrics achieve correlations above 0.89. Notably, density-based metrics show no significant predictive validity. These findings indicate that eigenvalue-derived measures capture cognitive and effort dimensions that simpler connectivity metrics cannot. As a result, this research bridges a critical methodological gap between architectural complexity analysis and requirements engineering practice, providing a validated foundation for applying these metrics to requirements engineering, where similar structural complexity patterns may predict integration effort.

141. 【2602.07181】Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs

链接https://arxiv.org/abs/2602.07181

作者:Tianyu Zhao,Siqi Li,Yasser Shoukry,Salma Elmalaki

类目:Computation and Language (cs.CL)

关键词:personalize Large Language, Large Language Model, Large Language, personalize Large, generation remains under-explored

备注

点击查看摘要

Abstract:User preferences are increasingly used to personalize Large Language Model (LLM) responses, yet how to reliably leverage preference signals for answer generation remains under-explored. In practice, preferences can be noisy, incomplete, or even misleading, which can degrade answer quality when applied naively. Motivated by the observation that stable personality traits shape everyday preferences, we study personality as a principled ''latent'' signal behind preference statements. Through extensive experiments, we find that conditioning on personality-aligned preferences substantially improves personalized question answering: selecting preferences consistent with a user's inferred personality increases answer-choice accuracy from 29.25% to 76%, compared to using randomly selected preferences. Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g., travel, movies, education), annotated with Big-Five (OCEAN) trait directions. Finally, we propose a framework that enables an LLM model to automatically retrieve personality-aligned preferences and incorporate them during answer generation.

142. 【2602.07179】An Information-Theoretic Framework for Comparing Voice and Text Explainability

链接https://arxiv.org/abs/2602.07179

作者:Mona Rajhans,Vishal Khawarey

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Theory (cs.IT)

关键词:Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, make machine learning, current approaches communicate

备注: Accepted for publication at the 10th ACM International Conference on Intelligent Systems, Metaheuristics Swarm Intelligence (ISMSI 2026), April 24-26, Cebu City, Phillipines

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) aims to make machine learning models transparent and trustworthy, yet most current approaches communicate explanations visually or through text. This paper introduces an information theoretic framework for analyzing how explanation modality specifically, voice versus text affects user comprehension and trust calibration in AI systems. The proposed model treats explanation delivery as a communication channel between model and user, characterized by metrics for information retention, comprehension efficiency (CE), and trust calibration error (T CE). A simulation framework implemented in Python was developed to evaluate these metrics using synthetic SHAP based feature attributions across multiple modality style configurations (brief, detailed, and analogy based). Results demonstrate that text explanations achieve higher comprehension efficiency, while voice explanations yield improved trust calibration, with analogy based delivery achieving the best overall trade off. This framework provides a reproducible foundation for designing and benchmarking multimodal explainability systems and can be extended to empirical studies using real SHAP or LIME outputs on open datasets such as the UCI Credit Approval or Kaggle Financial Transactions datasets.

143. 【2602.07176】Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI

链接https://arxiv.org/abs/2602.07176

作者:Mohamed El Hajji,Tarek Ait Baha,Aicha Dakir,Hammou Fadili,Youssef Es-Saady

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)

关键词:Recent advances, education more scalable, advances in artificial, artificial intelligence, intelligence have created

备注: 19 pages, 15 figures

点击查看摘要

Abstract:Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner's goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.

144. 【2602.07164】Your Language Model Secretly Contains Personality Subnetworks

链接https://arxiv.org/abs/2602.07164

作者:Ruimeng Ye,Zihan Wang,Zinan Ling,Yang Xiao,Manling Li,Xiaolong Ma,Bo Hui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Humans shift, depending on social, Humans, parameter space, personas

备注: ICLR 2026

点击查看摘要

Abstract:Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetwork from the model that lead to binary-opposing personas, such as introvert-extrovert? To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space, pointing toward a new perspective on controllable and interpretable personalization in large language models.

145. 【2602.07160】Free Energy Mixer

链接https://arxiv.org/abs/2602.07160

作者:Jiecheng Lu,Shihao Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:per-head convex average, blocking channel-wise selection, Free Energy Mixer, Standard attention stores, attention stores keys

备注: Camera-ready version. Accepted at ICLR 2026

点击查看摘要

Abstract:Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

146. 【2602.07145】Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

链接https://arxiv.org/abs/2602.07145

作者:Zhiqi Bu,Shiyun Xu,Jialin Mao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)

关键词:non-convex loss landscape, hard to analyze, Deep learning, learning, optimization dynamics

备注: Part of a planned series to understand and leverage the convexity in deep learning. Accepted to ICLR 2026

点击查看摘要

Abstract:Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.

147. 【2602.07143】Massive Sound Embedding Benchmark (MSEB)

链接https://arxiv.org/abs/2602.07143

作者:Georg Heigold,Ehsan Variani,Tom Bagby,Cyril Allauzen,Ji Ma,Shankar Kumar,Michael Riley

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:demonstrate a wide, wide range, Sound Embedding Benchmark, Massive Sound Embedding, Simple Voice Questions

备注

点击查看摘要

Abstract:Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.

148. 【2602.07120】Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

链接https://arxiv.org/abs/2602.07120

作者:Jacqueline He,Jonathan Hayase,Wen-tau Yih,Sewoong Oh,Luke Zettlemoyer,Pang Wei Koh

类目:Computation and Language (cs.CL)

关键词:Modern language models, emit verbatim spans, Modern language, Anchored Decoding, tend to memorize

备注: 51 pages, 12 figures, 16 tables. Code is publicly available at [this https URL](https://github.com/jacqueline-he/anchored-decoding)

点击查看摘要

Abstract:Modern language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). We evaluate our methods across six model pairs on long-form evaluations of copyright risk and utility. Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.

149. 【2602.07106】Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

链接https://arxiv.org/abs/2602.07106

作者:Haoyu Zhang,Zhipeng Li,Yiwen Guo,Tianshu Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, unify multimodal understanding, remains largely unexplored, Omni-modal large language, animation remains largely

备注

点击查看摘要

Abstract:Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

150. 【2602.07086】Evaluating Retrieval-Augmented Generation Variants for Natural Language-Based SQL and API Call Generation

链接https://arxiv.org/abs/2602.07086

作者:Michael Marketsmüller,Simon Martin,Tim Schlippe

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:systems increasingly require, translate user requests, REST API calls, REST API call, Enterprise systems increasingly

备注: preprint of conference submission

点击查看摘要

Abstract:Enterprise systems increasingly require natural language interfaces that can translate user requests into structured operations such as SQL queries and REST API calls. While large language models (LLMs) show promise for code generation [Chen et al., 2021; Huynh and Lin, 2025], their effectiveness in domain-specific enterprise contexts remains underexplored, particularly when both retrieval and modification tasks must be handled jointly. This paper presents a comprehensive evaluation of three retrieval-augmented generation (RAG) variants [Lewis et al., 2021] -- standard RAG, Self-RAG [Asai et al., 2024], and CoRAG [Wang et al., 2025] -- across SQL query generation, REST API call generation, and a combined task requiring dynamic task classification. Using SAP Transactional Banking as a realistic enterprise use case, we construct a novel test dataset covering both modalities and evaluate 18 experimental configurations under database-only, API-only, and hybrid documentation contexts. Results demonstrate that RAG is essential: Without retrieval, exact match accuracy is 0% across all tasks, whereas retrieval yields substantial gains in execution accuracy (up to 79.30%) and component match accuracy (up to 78.86%). Critically, CoRAG proves most robust in hybrid documentation settings, achieving statistically significant improvements in the combined task (10.29% exact match vs. 7.45% for standard RAG), driven primarily by superior SQL generation performance (15.32% vs. 11.56%). Our findings establish retrieval-policy design as a key determinant of production-grade natural language interfaces, showing that iterative query decomposition outperforms both top-k retrieval and binary relevance filtering under documentation heterogeneity.

151. 【2602.07079】Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark

链接https://arxiv.org/abs/2602.07079

作者:Go Frendi Gunawan,Mukhlis Amien

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, activities remain limited, demonstrated remarkable capabilities, comprehensive benchmarks covering

备注: 10 pages, 7 figures. Under review. Code and data will be fully released

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency and inference inefficiency; and (4) coding tasks achieve 100 percent success while research tasks present greater challenges (90.9 percent). We release all experimental data, verification scripts, and analysis code for full reproducibility.

152. 【2602.07055】heory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

链接https://arxiv.org/abs/2602.07055

作者:Pingyue Zhang,Zihan Huang,Yue Wang,Jieyu Zhang,Letian Xue,Zihan Wang,Qineng Wang,Keshigeyan Chandrasegaran,Ruohan Zhang,Yejin Choi,Ranjay Krishna,Jiajun Wu,Li Fei-Fei,Manling Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:embodied intelligence requires, Spatial embodied intelligence, intelligence requires agents, embodied intelligence, intelligence requires

备注: published at iclr 2026

点击查看摘要

Abstract:Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.

153. 【2602.07038】UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

链接https://arxiv.org/abs/2602.07038

作者:Yifan Ji,Zhipeng Xu,Zhenghao Liu,Zulong Chen,Qian Zhang,Zhibo Yang,Junyang Lin,Yu Gu,Ge Yu,Maosong Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:remains challenging due, Key Information Extraction, Large Multimodal Models, task-specific information requirements, real-world documents remains

备注

点击查看摘要

Abstract:Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at this https URL.

154. 【2602.07036】MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

链接https://arxiv.org/abs/2602.07036

作者:Zien Sheikh Ali,Hunzalah Hassan Bhatti,Rabindra Nath Nandi,Shammur Absar Chowdhury,Firoj Alam

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:large language models, Audio large language, instruction-aligned speech-text data, Modern Standard Arabic, language models

备注: Foundation Models, Large Language Models, Native, Speech Models, Arabic, AI-persona, Persona-conditioned-conversations

点击查看摘要

Abstract:Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.

155. 【2602.07032】LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

链接https://arxiv.org/abs/2602.07032

作者:Yuheng Wu,Berk Gokmen,Zhouhua Xie,Peijing Li,Caroline Trippel,Priyanka Raina,Thierry Tambe

类目:Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)

关键词:implement state-dependent behavior, hardware design, ability to understand, understand and implement, implement state-dependent

备注

点击查看摘要

Abstract:Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.

156. 【2602.06993】Attractor Patch Networks: Reducing Catastrophic Forgetting with Routed Low-Rank Patch Experts

链接https://arxiv.org/abs/2602.06993

作者:Shashank

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:typically updated end, position-wise feed-forward networks, updated end, APN, https URL

备注: 9 pages. Code (APN implementation in nanoGPT transformer): [this https URL](https://github.com/shankch/nanoGPT-apn) (baseline: [this https URL](https://github.com/karpathy/nanoGPT) ) Data prep: [this https URL](https://github.com/karpathy/nanoGPT/tree/master/data/shakespeare_char) and [this https URL](https://github.com/karpathy/nanoGPT/tree/master/data/shakespeare)

点击查看摘要

Abstract:Transformers achieve strong language modeling accuracy, yet their position-wise feed-forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug-compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top-k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low-rank residual update conditioned on a compact code. The architecture yields conditional, context-specialized nonlinear transformations while preserving the standard Transformer interface. This paper focuses on APN as an architectural primitive. We formalize APN, analyze its expressivity as a piecewise low-rank residual function class, and derive simple interference and stability arguments that make APN naturally compatible with continual learning. In experiments on character-level language modeling, APN achieves competitive perplexity (4.57 vs 4.32 PPL) while enabling dramatically better continual adaptation: when adapting to a shifted domain, APN achieves 2.6 times better retention (11.1 vs 29.4 PPL on the original domain) and 2.8 times better adaptation (6.4 vs 17.8 PPL on the new domain) compared to global fine-tuning of a dense FFN baseline.

Comments:
9 pages. Code (APN implementation in nanoGPT transformer): this https URL (baseline: this https URL) Data prep: this https URL and this https URL

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2602.06993 [cs.LG]

(or
arXiv:2602.06993v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.06993

Focus to learn more

              arXiv-issued DOI via DataCite</p>
157. 【2602.06976】Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks

链接https://arxiv.org/abs/2602.06976

作者:Chen Shen,Wei Cheng,Jingyue Yang,Huan Zhang,Yuhan Wu,Wei Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)

关键词:Large Language Models, extensive pre-training corpora, previously unfamiliar programming, proficiency of Large, Inference-time Language Acquisition

备注

点击查看摘要

Abstract:The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.

158. 【2602.06975】BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents

链接https://arxiv.org/abs/2602.06975

作者:R. James Cotton,Thomas Leonard

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:making quantitative movement, movement analysis increasingly, quantitative movement analysis, resulting data remains, Markerless motion capture

备注

点击查看摘要

Abstract:Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent's capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open-weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.

159. 【2602.06973】Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

链接https://arxiv.org/abs/2602.06973

作者:Lucky Susanto,Musa Izzanardi Wijanarko,Khumaisa Nur'aini,Farid Adilazuarda,Alham Fikri Aji,Derry Tanti Wijaya

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:improve autoregressive performance, sub-word tokenization bottleneck, language modeling aims, autoregressive performance, pixel-based language modeling

备注: Submitted to ARR January

点击查看摘要

Abstract:While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

160. 【2602.06967】Leveraging Adaptive Group Negotiation for Heterogeneous Multi-Robot Collaboration with Large Language Models

链接https://arxiv.org/abs/2602.06967

作者:Siqi Song,Xuanbing Xie,Zonglin Li,Yuqiang Li,Shijie Wang,Biqing Qi

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, environmental uncertainties, Language Models, long horizons, horizons under spatial

备注: 20 pages, 12 figures, Under Review

点击查看摘要

Abstract:Multi-robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large-Language-Model-Driven Heterogeneous Multi-Robot System), an adaptive group negotiation framework among LLMs for multi-robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception-driven multi-LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping-planning-execution-feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi-robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human-inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi-robot collaboration. Our code is available here: this https URL.

161. 【2602.06000】Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

链接https://arxiv.org/abs/2602.06000

作者:Ali Shendabadi,Parnia Izadirad,Mostafa Salehi,Mahmoud Bijankhan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Speech Emotion Recognition, faced limitations due, sufficiently large datasets, Speech Emotion, Emotion Recognition

备注

点击查看摘要

Abstract:Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

162. 【2602.08275】Linguistics and Human Brain: A Perspective of Computational Neuroscience

链接https://arxiv.org/abs/2602.08275

作者:Fudong Zhang,Bo Chai,Yujie Wu,Wai Ting Siok,Nizhuan Wang

类目:Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)

关键词:language-brain relationship requires, relationship requires bridging, Elucidating the language-brain, abstract theoretical frameworks, empirical neural data

备注

点击查看摘要

Abstract:Elucidating the language-brain relationship requires bridging the methodological gap between the abstract theoretical frameworks of linguistics and the empirical neural data of neuroscience. Serving as an interdisciplinary cornerstone, computational neuroscience formalizes the hierarchical and dynamic structures of language into testable neural models through modeling, simulation, and data analysis. This enables a computational dialogue between linguistic hypotheses and neural mechanisms. Recent advances in deep learning, particularly large language models (LLMs), have powerfully advanced this pursuit. Their high-dimensional representational spaces provide a novel scale for exploring the neural basis of linguistic processing, while the "model-brain alignment" framework offers a methodology to evaluate the biological plausibility of language-related theories.

163. 【2602.07547】Linguistic properties and model scale in brain encoding: from small to compressed language models

链接https://arxiv.org/abs/2602.07547

作者:Subba Reddy Oota,Vijay Rowtula,Satya Sai Srinath Namburi,Khushbu Pahwa,Anant Khandelwal,Manish Gupta,Tanmoy Chakraborty,Bapi S. Raju

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent work, human brain activity, properties are responsible, work has shown, unclear what drives

备注: 40 pages, 33 figures

点击查看摘要

Abstract:Recent work has shown that scaling large language models (LLMs) improves their alignment with human brain activity, yet it remains unclear what drives these gains and which representational properties are responsible. Although larger models often yield better task performance and brain alignment, they are increasingly difficult to analyze mechanistically. This raises a fundamental question: what is the minimal model capacity required to capture brain-relevant representations? To address this question, we systematically investigate how constraining model scale and numerical precision affects brain alignment. We compare full-precision LLMs, small language models (SLMs), and compressed variants (quantized and pruned) by predicting fMRI responses during naturalistic language comprehension. Across model families up to 14B parameters, we find that 3B SLMs achieve brain predictivity indistinguishable from larger LLMs, whereas 1B models degrade substantially, particularly in semantic language regions. Brain alignment is remarkably robust to compression: most quantization and pruning methods preserve neural predictivity, with GPTQ as a consistent exception. Linguistic probing reveals a dissociation between task performance and brain predictivity: compression degrades discourse, syntax, and morphology, yet brain predictivity remains largely unchanged. Overall, brain alignment saturates at modest model scales and is resilient to compression, challenging common assumptions about neural scaling and motivating compact models for brain-aligned language modeling.

164. 【2602.07539】raining-Driven Representational Geometry Modularization Predicts Brain Alignment in Language Models

链接https://arxiv.org/abs/2602.07539

作者:Yixuan Liu,Zhiyuan Ma,Likai Tang,Runmin Gan,Xinche Zhang,Jinhao Li,Chao Xie,Sen Song

类目:Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)

关键词:cognitive science, neural representation, representation and computation, central question, question in cognitive

备注

点击查看摘要

Abstract:How large language models (LLMs) align with the neural representation and computation of human language is a central question in cognitive science. Using representational geometry as a mechanistic lens, we addressed this by tracking entropy, curvature, and fMRI encoding scores throughout Pythia (70M-1B) training. We identified a geometric modularization where layers self-organize into stable low- and high-complexity clusters. The low-complexity module, characterized by reduced entropy and curvature, consistently better predicted human language network activity. This alignment followed heterogeneous spatial-temporal trajectories: rapid and stable in temporal regions (AntTemp, PostTemp), but delayed and dynamic in frontal areas (IFG, IFGorb). Crucially, reduced curvature remained a robust predictor of model-brain alignment even after controlling for training progress, an effect that strengthened with model scale. These results links training-driven geometric reorganization to temporal-frontal functional specialization, suggesting that representational smoothing facilitates neural-like linguistic processing.

165. 【2602.07075】LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

链接https://arxiv.org/abs/2602.07075

作者:Xinwu Ye,Yicheng Mao,Jia Zhang,Yimeng Liu,Li Hao,Fang Wu,Zhiwei Li,Yuxuan Liao,Zehong Wang,Zhiyuan Liu,Zhenfei Yin,Li Yuan,Philip Torr,Huan Sun,Xiangxiang Zeng,Mengdi Wang,Le Cong,Shenghua Gao,Xiangru Tang

类目:Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:predominantly rely, rely on explicit, Chemical large language, perform complex reasoning, large language models

备注

点击查看摘要

Abstract:Chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) in natural language to perform complex reasoning. However, chemical reasoning is inherently continuous and structural, and forcing it into discrete linguistic tokens introduces a fundamental representation mismatch that constrains both efficiency and performance. We introduce LatentChem, a latent reasoning interface that decouples chemical computation from textual generation, enabling models to perform multi-step reasoning directly in continuous latent space while emitting language only for final outputs. Remarkably, we observe a consistent emergent behavior: when optimized solely for task success, models spontaneously internalize reasoning, progressively abandoning verbose textual derivations in favor of implicit latent computation. This shift is not merely stylistic but computationally advantageous. Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84$\times$ average inference speedup. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

166. 【2007.16012】BERT Learns (and Teaches) Chemistry

链接https://arxiv.org/abs/2007.16012

作者:Josh Payne,Mario Srouji,Dian Ang Yap,Vineet Kosaraju

类目:Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Modern computational organic, Modern computational, computational organic chemistry, computational organic, increasingly data-driven

备注: 10 pages, 5 figures

点击查看摘要

Abstract:Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.

信息检索

1. 【2602.08917】Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion

链接https://arxiv.org/abs/2602.08917

作者:Minghan Li,Ercong Nie,Siqi Zhao,Tongna Chen,Huiping Huang,Guodong Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:manually chosen exemplars, large language models, hand-crafted prompts, manually chosen, making it non-scalable

备注

点击查看摘要

Abstract:Query expansion with large language models is promising but often relies on hand-crafted prompts, manually chosen exemplars, or a single LLM, making it non-scalable and sensitive to domain shift. We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline. A training-free cluster-based strategy selects diverse demonstrations, yielding strong and stable in-context QE without supervision. To further exploit model complementarity, we introduce a two-LLM ensemble in which two heterogeneous LLMs independently generate expansions and a refinement LLM consolidates them into one coherent expansion. Across TREC DL20, DBPedia, and SciFact, the refined ensemble delivers consistent and statistically significant gains over BM25, Rocchio, zero-shot, and fixed few-shot baselines. The framework offers a reproducible testbed for exemplar selection and multi-LLM generation, and a practical, label-free solution for real-world QE.

2. 【2602.08896】OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

链接https://arxiv.org/abs/2602.08896

作者:Yehua Huang,Penglei Sun,Zebin Chen,Zhenheng Tang,Xiaowen Chu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:peer review remains, Academic peer review, remains the cornerstone, field faces, faces some challenges

备注

点击查看摘要

Abstract:Academic peer review remains the cornerstone of scholarly validation, yet the field faces some challenges in data and methods. From the data perspective, existing research is hindered by the scarcity of large-scale, verified benchmarks and oversimplified evaluation metrics that fail to reflect real-world editorial workflows. To bridge this gap, we present OmniReview, a comprehensive dataset constructed by integrating multi-source academic platforms encompassing comprehensive scholarly profiles through the disambiguation pipeline, yielding 202, 756 verified review records. Based on this data, we introduce a three-tier hierarchical evaluaion framework to assess recommendations from recall to precise expert identification. From the method perspective, existing embedding-based approaches suffer from the information bottleneck of semantic compression and limited interpretability. To resolve these method limitations, we propose Profiling Scholars with Multi-gate Mixture-of-Experts (Pro-MMoE), a novel framework that synergizes Large Language Models (LLMs) with Multi-task Learning. Specifically, it utilizes LLM-generated semantic profiles to preserve fine-grained expertise nuances and interpretability, while employing a Task-Adaptive MMoE architecture to dynamically balance conflicting evaluation goals. Comprehensive experiments demonstrate that Pro-MMoE achieves state-of-the-art performance across six of seven metrics, establishing a new benchmark for realistic reviewer recommendation.

3. 【2602.08886】Contrastive Learning for Diversity-Aware Product Recommendations in Retail

链接https://arxiv.org/abs/2602.08886

作者:Vasileios Karlis,Ezgi Yıldırım,David Vos,Maarten de Rijke

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:popular items dominates, Recommender systems, item catalog exposure, limited item catalog, items dominates recommendations

备注

点击查看摘要

Abstract:Recommender systems often struggle with long-tail distributions and limited item catalog exposure, where a small subset of popular items dominates recommendations. This challenge is especially critical in large-scale online retail settings with extensive and diverse product assortments. This paper introduces an approach to enhance catalog coverage without compromising recommendation quality in the existing digital recommendation pipeline at IKEA Retail. Drawing inspiration from recent advances in negative sampling to address popularity bias, we integrate contrastive learning with carefully selected negative samples. Through offline and online evaluations, we demonstrate that our method improves catalog coverage, ensuring a more diverse set of recommendations yet preserving strong recommendation performance.

4. 【2602.08873】Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

链接https://arxiv.org/abs/2602.08873

作者:Lisette Espin-Noboa,Gonzalo Gabriel Mendez

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)

关键词:Large language models, Large language, academic expert recommendation, Large, language models

备注: 28 pages: 8 pages in main (5 figures, 1 table), 20 pages in appendix (18 figures, 2 tables). under-review

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures both technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that end-user interventions do not yield uniform improvements but instead redistribute error across dimensions. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing a general fix. We release code and data that can be adapted to other disciplines by replacing domain-specific ground truth and metrics.

5. 【2602.08872】Large Language Models for Geolocation Extraction in Humanitarian Crisis Response

链接https://arxiv.org/abs/2602.08872

作者:G. Cafferata,T. Demarco,K. Kalimeri,Y. Mejova,M.G. Beiró

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:crises demand timely, Humanitarian crises demand, effective response efforts, inform effective response, Large Language Models

备注

点击查看摘要

Abstract:Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.

6. 【2602.08837】AMEM4Rec: Leveraging Cross-User Similarity for Memory Evolution in Agentic LLM Recommenders

链接https://arxiv.org/abs/2602.08837

作者:Minh-Duc Nguyen,Hai-Dang Kieu,Dung D. Le

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, shown strong potential, powered by Large, Language Models

备注

点击查看摘要

Abstract:Agentic systems powered by Large Language Models (LLMs) have shown strong potential in recommender systems but remain hindered by several challenges. Fine-tuning LLMs is parameter-inefficient, and prompt-based agentic reasoning is limited by context length and hallucination risk. Moreover, existing agentic recommendation systems predominantly leverages semantic knowledge while neglecting the collaborative filtering (CF) signals essential for implicit preference modeling. To address these limitations, we propose AMEM4Rec, an agentic LLM-based recommender that learns collaborative signals in an end-to-end manner through cross-user memory evolution. AMEM4Rec stores abstract user behavior patterns from user histories in a global memory pool. Within this pool, memories are linked to similar existing ones and iteratively evolved to reinforce shared cross-user patterns, enabling the system to become aware of CF signals without relying on a pre-trained CF model. Extensive experiments on Amazon and MIND datasets show that AMEM4Rec consistently outperforms state-of-the-art LLM-based recommenders, demonstrating the effectiveness of evolving memory-guided collaborative filtering.

7. 【2602.08742】Welfarist Formulations for Diverse Similarity Search

链接https://arxiv.org/abs/2602.08742

作者:Siddharth Barman,Nirjhar Das,Shivam Gupta,Kirankumar Shiragur

类目:Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:recommendation systems, retrieval-augmented generations, fundamental problem, problem in data, data structures

备注

点击查看摘要

Abstract:Nearest Neighbor Search (NNS) is a fundamental problem in data structures with wide-ranging applications, such as web search, recommendation systems, and, more recently, retrieval-augmented generations (RAG). In such recent applications, in addition to the relevance (similarity) of the returned neighbors, diversity among the neighbors is a central requirement. In this paper, we develop principled welfare-based formulations in NNS for realizing diversity across attributes. Our formulations are based on welfare functions -- from mathematical economics -- that satisfy central diversity (fairness) and relevance (economic efficiency) axioms. With a particular focus on Nash social welfare, we note that our welfare-based formulations provide objective functions that adaptively balance relevance and diversity in a query-dependent manner. Notably, such a balance was not present in the prior constraint-based approach, which forced a fixed level of diversity and optimized for relevance. In addition, our formulation provides a parametric way to control the trade-off between relevance and diversity, providing practitioners with flexibility to tailor search results to task-specific requirements. We develop efficient nearest neighbor algorithms with provable guarantees for the welfare-based objectives. Notably, our algorithm can be applied on top of any standard ANN method (i.e., use standard ANN method as a subroutine) to efficiently find neighbors that approximately maximize our welfare-based objectives. Experimental results demonstrate that our approach is practical and substantially improves diversity while maintaining high relevance of the retrieved neighbors.

8. 【2602.08700】Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search

链接https://arxiv.org/abs/2602.08700

作者:Clemencia Siro,Zahra Abbasiantaeb,Yifei Yuan,Mohammad Aliannejadi,Maarten de Rijke

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:clarifying questions, increasingly employ clarifying, clarifying, questions, answering clarifying questions

备注: Accepted at CHIIR 2025

点击查看摘要

Abstract:Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.

9. 【2602.08678】SA-CAISR: Stage-Adaptive and Conflict-Aware Incremental Sequential Recommendation

链接https://arxiv.org/abs/2602.08678

作者:Xiaomeng Song,Xinru Wang,Hanbing Wang,Hongyu Lu,Yu Chen,Zhaochun Ren,Zhumin Chen

类目:Information Retrieval (cs.IR)

关键词:historical interaction sequences, aims to predict, interaction sequences, Incremental Sequential Recommendation, Sequential recommendation

备注

点击查看摘要

Abstract:Sequential recommendation (SR) aims to predict a user's next action by learning from their historical interaction sequences. In real-world applications, these models require periodic updates to adapt to new interactions and evolving user preferences. While incremental learning methods facilitate these updates, they face significant challenges. Replay-based approaches incur high memory and computational costs, and regularization-based methods often struggle to discard outdated or conflicting knowledge. To overcome these challenges, we propose SA-CAISR, a Stage-Adaptive and Conflict-Aware Incremental Sequential Recommendation framework. As a buffer-free framework, SA-CAISR operates using only the old model and new data, directly addressing the high costs of replay-based techniques. SA-CAISR introduces a novel Fisher-weighted knowledge-screening mechanism that dynamically identifies outdated knowledge by estimating parameter-level conflicts between the old model and new data, allowing our approach to selectively remove obsolete knowledge while preserving compatible historical patterns. This dynamic balance between stability and adaptability allows our method to achieve a new state-of-the-art performance in incremental SR. Specifically, SA-CAISR improves Recall@20 by 2.0%, MRR@20 by 1.2%, and NDCG@20 by 1.4% on average across datasets, while reducing memory usage by 97.5% and training time by 46.9% compared to the best baselines. This efficiency allows real-world systems to rapidly update user profiles with minimal computational overhead, ensuring more timely and accurate recommendations.

10. 【2602.08668】Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion

链接https://arxiv.org/abs/2602.08668

作者:Scott Thornton

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Hybrid Retrieval-Augmented Generation, combine vector similarity, vector similarity search, Retrieval-Augmented Generation, pipelines combine vector

备注: 18 pages, 5 figures

点击查看摘要

Abstract:Hybrid Retrieval-Augmented Generation (RAG) pipelines combine vector similarity search with knowledge graph expansion for multi-hop reasoning. We show that this composition introduces a distinct security failure mode: a vector-retrieved "seed" chunk can pivot via entity links into sensitive graph neighborhoods, causing cross-tenant data leakage that does not occur in vector-only retrieval. We formalize this risk as Retrieval Pivot Risk (RPR) and introduce companion metrics Leakage@k, Amplification Factor, and Pivot Depth (PD) to quantify leakage magnitude and traversal structure. We present seven Retrieval Pivot Attacks that exploit the vector-to-graph boundary and show that adversarial injection is not required: naturally shared entities create cross-tenant pivot paths organically. Across a synthetic multi-tenant enterprise corpus and the Enron email corpus, the undefended hybrid pipeline exhibits high pivot risk (RPR up to 0.95) with multiple unauthorized items returned per query. Leakage consistently appears at PD=2, which we attribute to the bipartite chunk-entity topology and formalize as a proposition. We then show that enforcing authorization at a single location, the graph expansion boundary, eliminates measured leakage (RPR near 0) across both corpora, all attack variants, and label forgery rates up to 10 percent, with minimal overhead. Our results indicate the root cause is boundary enforcement, not inherently complex defenses: two individually secure retrieval components can compose into an insecure system unless authorization is re-checked at the transition point.

Comments:
18 pages, 5 figures

Subjects:

Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2602.08668 [cs.CR]

(or
arXiv:2602.08668v2 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2602.08668

Focus to learn more

              arXiv-issued DOI via DataCite</p>
11. 【2602.08667】SRSUPM: Sequential Recommender System Based on User Psychological Motivation

链接https://arxiv.org/abs/2602.08667

作者:Yicheng Di,Yuan Liu,Zhi Chen,Jingcai Guo

类目:Information Retrieval (cs.IR)

关键词:psychological motivation shift, psychological motivation, motivation shift, Psychological Motivation Shift-driven, motivation

备注: 9 pages, 8 pages

点击查看摘要

Abstract:Sequential recommender infers users' evolving psychological motivations from historical interactions to recommend the next preferred items. Most existing methods compress recent behaviors into a single vector and optimize it toward a single observed target item, but lack explicit modeling of psychological motivation shift. As a result, they struggle to uncover the distributional patterns across different shift degrees and to capture collaborative knowledge that is sensitive to psychological motivation shift. We propose a general framework, the Sequential Recommender System Based on User Psychological Motivation, to enhance sequential recommenders with psychological motivation shift-aware user modeling. Specifically, the Psychological Motivation Shift Assessment quantitatively measures psychological motivation shift; guided by PMSA, the Shift Information Construction models dynamically evolving multi-level shift states, and the Psychological Motivation Shift-driven Information Decomposition decomposes and regularizes representations across shift levels. Moreover, the Psychological Motivation Shift Information Matching strengthens collaborative patterns related to psychological motivation shift to learn more discriminative user representations. Extensive experiments on three public benchmarks show that SRSUPM consistently outperforms representative baselines on diverse sequential recommender tasks.

12. 【2602.08612】OneLive: Dynamically Unified Generative Framework for Live-Streaming Recommendation

链接https://arxiv.org/abs/2602.08612

作者:Shen Wang,Yusheng Huang,Ruochen Yang,Shuang Wen,Pengbo Xu,Jiangxia Cao,Yueyang Liu,Kuo Cai,Chengcheng Guo,Shiyao Wang,Xinchen Luo,Qiang Luo,Ruiming Tang,Shuang Yang,Zhaojie Liu,Guorui Zhou,Han Li,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:recommender system serves, users and authors, Live-streaming recommender system, serves as critical, critical infrastructure

备注: Work in progress

点击查看摘要

Abstract:Live-streaming recommender system serves as critical infrastructure that bridges the patterns of real-time interactions between users and authors. Similar to traditional industrial recommender systems, live-streaming recommendation also relies on cascade architectures to support large-scale concurrency. Recent advances in generative recommendation unify the multi-stage recommendation process with Transformer-based architectures, offering improved scalability and higher computational efficiency. However, the inherent complexity of live-streaming prevents the direct transfer of these methods to live-streaming scenario, where continuously evolving content, limited lifecycles, strict real-time constraints, and heterogeneous multi-objectives introduce unique challenges that invalidate static tokenization and conventional model framework. To address these issues, we propose OneLive, a dynamically unified generative recommendation framework tailored for live-streaming scenario. OneLive integrates four key components: (i) A Dynamic Tokenizer that continuously encodes evolving real-time live content fused with behavior signal through residual quantization; (ii) A Time-Aware Gated Attention mechanism that explicitly models temporal dynamics for timely decision making; (iii) An efficient decoder-only generative architecture enhanced with Sequential MTP and QK Norm for stable training and accelerated inference; (iv) A Unified Multi-Objective Alignment Framework reinforces policy optimization for personalized preferences.

13. 【2602.08575】RankGR: Rank-Enhanced Generative Retrieval with Listwise Direct Preference Optimization in Recommendation

链接https://arxiv.org/abs/2602.08575

作者:Kairui Fu,Changfa Wu,Kun Yuan,Binbin Cao,Dunxian Huang,Yuliang Yan,Junjun Zheng,Jianning Zhang,Silu Zhou,Jian Wu,Kun Kuang

类目:Information Retrieval (cs.IR)

关键词:autoregressively decoding identifiers, Rank-enhanced Generative Retrieval, promising paradigm, autoregressively decoding, Generative retrieval

备注

点击查看摘要

Abstract:Generative retrieval (GR) has emerged as a promising paradigm in recommendation systems by autoregressively decoding identifiers of target items. Despite its potential, current approaches typically rely on the next-token prediction schema, which treats each token of the next interacted items as the sole target. This narrow focus 1) limits their ability to capture the nuanced structure of user preferences, and 2) overlooks the deep interaction between decoded identifiers and user behavior sequences. In response to these challenges, we propose RankGR, a Rank-enhanced Generative Retrieval method that incorporates listwise direct preference optimization for recommendation. RankGR decomposes the retrieval process into two complementary stages: the Initial Assessment Phase (IAP) and the Refined Scoring Phase (RSP). In IAP, we incorporate a novel listwise direct preference optimization strategy into GR, thus facilitating a more comprehensive understanding of the hierarchical user preferences and more effective partial-order modeling. The RSP then refines the top-{\lambda} candidates generated by IAP with interactions towards input sequences using a lightweight scoring module, leading to more precise candidate evaluation. Both phases are jointly optimized under a unified GR model, ensuring consistency and efficiency. Additionally, we implement several practical improvements in training and deployment, ultimately achieving a real-time system capable of handling nearly ten thousand requests per second. Extensive offline performance on both research and industrial datasets, as well as the online gains on the "Guess You Like" section of Taobao, validate the effectiveness and scalability of RankGR.

14. 【2602.08569】owards Reliable Social A/B Testing: Spillover-Contained Clustering with Robust Post-Experiment Analysis

链接https://arxiv.org/abs/2602.08569

作者:Xu Min,Zhaoxu Yang,Kaixuan Tan,Juan Yan,Xunbin Xiong,Zihao Zhu,Kaiyu Zhu,Fenglin Cui,Yang Yang,Sihua Yang,Jianhui Bu

类目:ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)

关键词:control group, foundation of decision-making, decision-making in online, products often suffer, treatment effects

备注

点击查看摘要

Abstract:A/B testing is the foundation of decision-making in online platforms, yet social products often suffer from network interference: user interactions cause treatment effects to spill over into the control group. Such spillovers bias causal estimates and undermine experimental conclusions. Existing approaches face key limitations: user-level randomization ignores network structure, while cluster-based methods often rely on general-purpose clustering that is not tailored for spillover containment and has difficulty balancing unbiasedness and statistical power at scale. We propose a spillover-contained experimentation framework with two stages. In the pre-experiment stage, we build social interaction graphs and introduce a Balanced Louvain algorithm that produces stable, size-balanced clusters while minimizing cross-cluster edges, enabling reliable cluster-based randomization. In the post-experiment stage, we develop a tailored CUPAC estimator that leverages pre-experiment behavioral covariates to reduce the variance induced by cluster-level assignment, thereby improving statistical power. Together, these components provide both structural spillover containment and robust statistical inference. We validate our approach through large-scale social sharing experiments on Kuaishou, a platform serving hundreds of millions of users. Results show that our method substantially reduces spillover and yields more accurate assessments of social strategies than traditional user-level designs, establishing a reliable and scalable framework for networked A/B testing.

15. 【2602.08559】QARM V2: Quantitative Alignment Multi-Modal Recommendation for Reasoning User Sequence Modeling

链接https://arxiv.org/abs/2602.08559

作者:Tian Xia,Jiaqi Zhang,Yueyang Liu,Hongjian Dou,Tingya Yin,Jiangxia Cao,Xulei Liang,Tianlu Xie,Lihao Liu,Xiang Chen,Shen Wang,Changxin Lao,Haixiang Gan,Jinkai Yu,Keting Cen,Lu Hao,Xu Zhang,Qiqiang Zhong,Zhongbo Sun,Yiyu Wang,Shuang Yang,Mingxin Wen,Xiangyu Wu,Shaoguo Liu,Tingting Gao,Zhaojie Liu,Han Li,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:large language models, industrial recommendation systems, General Search Unit, Exact Search Unit, enhance industrial recommendation

备注: Work in progress

点击查看摘要

Abstract:With the evolution of large language models (LLMs), there is growing interest in leveraging their rich semantic understanding to enhance industrial recommendation systems (RecSys). Traditional RecSys relies on ID-based embeddings for user sequence modeling in the General Search Unit (GSU) and Exact Search Unit (ESU) paradigm, which suffers from low information density, knowledge isolation, and weak generalization ability. While LLMs offer complementary strengths with dense semantic representations and strong generalization, directly applying LLM embeddings to RecSys faces critical challenges: representation unmatch with business objectives and representation unlearning end-to-end with downstream tasks. In this paper, we present QARM V2, a unified framework that bridges LLM semantic understanding with RecSys business requirements for user sequence modeling.

16. 【2602.08545】DA-RAG: Dynamic Attributed Community Search for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.08545

作者:Xingyuan Zeng,Zuohan Wu,Yue Wang,Chen Zhang,Quanming Yao,Libin Zheng,Jian Yin

类目:Information Retrieval (cs.IR)

关键词:large language models, unprecedented comprehension capabilities, web search engines, comprehension capabilities, large language

备注

点击查看摘要

Abstract:Owing to their unprecedented comprehension capabilities, large language models (LLMs) have become indispensable components of modern web search engines. From a technical perspective, this integration represents retrieval-augmented generation (RAG), which enhances LLMs by grounding them in external knowledge bases. A prevalent technical approach in this context is graph-based RAG (G-RAG). However, current G-RAG methodologies frequently underutilize graph topology, predominantly focusing on low-order structures or pre-computed static communities. This limitation affects their effectiveness in addressing dynamic and complex queries. Thus, we propose DA-RAG, which leverages attributed community search (ACS) to extract relevant subgraphs based on the queried question dynamically. DA-RAG captures high-order graph structures, allowing for the retrieval of self-complementary knowledge. Furthermore, DA-RAG is equipped with a chunk-layer oriented graph index, which facilitates efficient multi-granularity retrieval while significantly reducing both computational and economic costs. We evaluate DA-RAG on multiple datasets, demonstrating that it outperforms existing RAG methods by up to 40% in head-to-head comparisons across four metrics while reducing index construction time and token overhead by up to 37% and 41%, respectively.

17. 【2602.08543】GISA: A Benchmark for General Information-Seeking Assistant

链接https://arxiv.org/abs/2602.08543

作者:Yutao Zhu,Xingshuo Zhang,Maosen Zhang,Jiajie Jin,Liancheng Zhang,Xiaoshuai Song,Kangzhi Zhao,Wencong Zeng,Ruiming Tang,Han Li,Ji-Rong Wen,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:multi-turn web interactions, large language models, web interactions, advancement of large, large language

备注

点击查看摘要

Abstract:The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.

18. 【2602.08530】PIT: A Dynamic Personalized Item Tokenizer for End-to-End Generative Recommendation

链接https://arxiv.org/abs/2602.08530

作者:Huanjie Wang,Xinchen Luo,Honghui Bao,Zhang Zixing,Lejian Ren,Yunfan Wu,Hongwei Zhang,Liwei Guan,Guang Chen

类目:Information Retrieval (cs.IR)

关键词:sequence generation task, revolutionized recommender systems, discrete item identifiers, systems by reformulating, reformulating retrieval

备注

点击查看摘要

Abstract:Generative Recommendation has revolutionized recommender systems by reformulating retrieval as a sequence generation task over discrete item identifiers. Despite the progress, existing approaches typically rely on static, decoupled tokenization that ignores collaborative signals. While recent methods attempt to integrate collaborative signals into item identifiers either during index construction or through end-to-end modeling, they encounter significant challenges in real-world production environments. Specifically, the volatility of collaborative signals leads to unstable tokenization, and current end-to-end strategies often devolve into suboptimal two-stage training rather than achieving true co-evolution. To bridge this gap, we propose PIT, a dynamic Personalized Item Tokenizer framework for end-to-end generative recommendation, which employs a co-generative architecture that harmonizes collaborative patterns through collaborative signal alignment and synchronizes item tokenizer with generative recommender via a co-evolution learning. This enables the dynamic, joint, end-to-end evolution of both index construction and recommendation. Furthermore, a one-to-many beam index ensures scalability and robustness, facilitating seamless integration into large-scale industrial deployments. Extensive experiments on real-world datasets demonstrate that PIT consistently outperforms competitive baselines. In a large-scale deployment at Kuaishou, an online A/B test yielded a substantial 0.402% uplift in App Stay Time, validating the framework's effectiveness in dynamic industrial environments.

19. 【2602.08457】Hybrid Pooling with LLMs via Relevance Context Learning

链接https://arxiv.org/abs/2602.08457

作者:David Otero,Javier Parapar

类目:Information Retrieval (cs.IR)

关键词:evaluating Information Retrieval, Information Retrieval, manual annotation remains, annotation remains costly, evaluating Information

备注

点击查看摘要

Abstract:High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.

20. 【2602.08411】A Sketch+Text Composed Image Retrieval Dataset for Thangka

链接https://arxiv.org/abs/2602.08411

作者:Jinyu Xu,Yi Sun,Jiangling Zhang,Qing Xie,Daomin Ji,Zhifeng Bao,Jiachen Li,Yanchun Ma,Yongjian Liu

类目:Information Retrieval (cs.IR)

关键词:multiple query modalities, combining multiple query, Composed Image Retrieval, enables image retrieval, short textual modifications

备注: 9 pages

点击查看摘要

Abstract:Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at this https URL.

21. 【2602.08254】SynthAgent: A Multi-Agent LLM Framework for Realistic Patient Simulation -- A Case Study in Obesity with Mental Health Comorbidities

链接https://arxiv.org/abs/2602.08254

作者:Arman Aghaee,Sepehr Asgarian,Jouhyun Jeon

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:Simulating high-fidelity patients, privacy-restricted real-world data, high-fidelity patients offers, studying complex diseases, Simulating high-fidelity

备注: Presented in AAAI 2026 Singapore at the workshop of Health Intelligence

点击查看摘要

Abstract:Simulating high-fidelity patients offers a powerful avenue for studying complex diseases while addressing the challenges of fragmented, biased, and privacy-restricted real-world data. In this study, we introduce SynthAgent, a novel Multi-Agent System (MAS) framework designed to model obesity patients with comorbid mental disorders, including depression, anxiety, social phobia, and binge eating disorder. SynthAgent integrates clinical and medical evidence from claims data, population surveys, and patient-centered literature to construct personalized virtual patients enriched with personality traits that influence adherence, emotion regulation, and lifestyle behaviors. Through autonomous agent interactions, the system simulates disease progression, treatment response, and life management across diverse psychosocial contexts. Evaluation of more than 100 generated patients demonstrated that GPT-5 and Claude 4.5 Sonnet achieved the highest fidelity as the core engine in the proposed MAS framework, outperforming Gemini 2.5 Pro and DeepSeek-R1. SynthAgent thus provides a scalable and privacy-preserving framework for exploring patient journeys, behavioral dynamics, and decision-making processes in both medical and psychological domains.

22. 【2602.08097】Prune, Don't Rebuild: Efficiently Tuning $α$-Reachable Graphs for Nearest Neighbor Search

链接https://arxiv.org/abs/2602.08097

作者:Tian Zhang,Ashwin Padaki,Jiaming Liang,Zack Ives,Erik Waingarten

类目:Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

关键词:Vector similarity search, alpha, essential primitive, primitive in modern, Vector similarity

备注

点击查看摘要

Abstract:Vector similarity search is an essential primitive in modern AI and ML applications. Most vector databases adopt graph-based approximate nearest neighbor (ANN) search algorithms, such as DiskANN (Subramanya et al., 2019), which have demonstrated state-of-the-art empirical performance. DiskANN's graph construction is governed by a reachability parameter $\alpha$, which gives a trade-off between construction time, query time, and accuracy. However, adaptively tuning this trade-off typically requires rebuilding the index for different $\alpha$ values, which is prohibitive at scale. In this work, we propose RP-Tuning, an efficient post-hoc routine, based on DiskANN's pruning step, to adjust the $\alpha$ parameter without reconstructing the full index. Within the $\alpha$-reachability framework of prior theoretical works (Indyk and Xu, 2023; Gollapudi et al., 2025), we prove that pruning an initially $\alpha$-reachable graph with RP-Tuning preserves worst-case reachability guarantees in general metrics and improved guarantees in Euclidean metrics. Empirically, we show that RP-Tuning accelerates DiskANN tuning on four public datasets by up to $43\times$ with negligible overhead.

23. 【2602.08070】IRB: Automated Generation of Robust Factuality Benchmarks

链接https://arxiv.org/abs/2602.08070

作者:Lam Thanh Do,Bhagyashree Taleka,Hozaifa Ammar Bhutta,Vikram Sharma Mailthody,Kevin Chen-Chuan Chang,Wen-mei Hwu

类目:Information Retrieval (cs.IR)

关键词:Static benchmarks, require significant manual, significant manual effort, RAG systems, maintain robustness

备注: Code: [this https URL](https://github.com/Hozaifa-Bhutta/IRB)

点击查看摘要

Abstract:Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.

24. 【2602.07987】Learning to Alleviate Familiarity Bias in Video Recommendation

链接https://arxiv.org/abs/2602.07987

作者:Zheng Ren,Yi Wu,Jianan Lu,Acar Ary,Yiqu Liu,Li Wei,Lukasz Heldt

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Modern video recommendation, Modern video, optimize user engagement, face structural exposure, structural exposure imbalances

备注: Accepted to the Companion Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, UAE

点击查看摘要

Abstract:Modern video recommendation systems aim to optimize user engagement and platform objectives, yet often face structural exposure imbalances caused by behavioral biases. In this work, we focus on the post-ranking stage and present LAFB (Learning to Alleviate Familiarity Bias), a lightweight and model-agnostic framework designed to mitigate familiarity bias in recommendation outputs. LAFB models user-content familiarity using discrete and continuous interaction features, and estimates personalized debiasing factors to adjust user rating prediction scores, thereby reducing the dominance of familiar content in the final ranking. We conduct large-scale offline evaluations and online A/B testing in a real-world recommendation system, under a unified serving stack that also compares LAFB with deployable popularity-oriented remedies. Results show that LAFB increases novel watch-time share and improves exposure for emerging creators and overall content diversity, while maintaining stable overall watch time and short-term satisfaction. LAFB has already been launched in the post-ranking stage of YouTube's recommendation system, demonstrating its effectiveness in real-world applications.

25. 【2602.07847】SimGR: Escaping the Pitfalls of Generative Decoding in LLM-based Recommendation

链接https://arxiv.org/abs/2602.07847

作者:Yuanbo Zhao,Ruochen Liu,Senzhang Wang,Jun Yin,Yuxin Dong,Huan Gong,Hao Chen,Shirui Pan,Chengqi Zhang

类目:Information Retrieval (cs.IR)

关键词:enable personalized recommendations, enable personalized, textbf, accurately model, personalized recommendations

备注

点击查看摘要

Abstract:A core objective in recommender systems is to accurately model the distribution of user preferences over items to enable personalized recommendations. Recently, driven by the strong generative capabilities of large language models (LLMs), LLM-based generative recommendation has become increasingly popular. However, we observe that existing methods inevitably introduce systematic bias when estimating item-level preference distributions. Specifically, autoregressive generation suffers from incomplete coverage due to beam search pruning, while parallel generation distorts probabilities by assuming token independence. We attribute this issue to a fundamental modeling mismatch: these methods approximate item-level distributions via token-level generation, which inherently induces approximation errors. Through both theoretical analysis and empirical validation, we demonstrate that token-level generation cannot faithfully substitute item-level generation, leading to biased item distributions. To address this, we propose \textbf{Sim}ply \textbf{G}enerative \textbf{R}ecommendation (\textbf{SimGR}), a framework that directly models item-level preference distributions in a shared latent space and ranks items by similarity, thereby aligning the modeling objective with recommendation and mitigating distributional distortion. Extensive experiments across multiple datasets and LLM backbones show that SimGR consistently outperforms existing generative recommenders. Our code is available at this https URL

26. 【2602.07840】SAGE: Scalable AI Governance Evaluation

链接https://arxiv.org/abs/2602.07840

作者:Benjamin Le,Xueying Lu,Nick Stern,Wenqiong Liu,Igor Lapchuk,Xiang Li,Baofen Zheng,Kevin Rosenberg,Jiewen Huang,Zhe Zhang,Abraham Cabangbang,Satej Milind Wagle,Jianqiang Shen,Raghavan Muthuregunathan,Abhinav Gupta,Mathew Teoh,Andrew Kirk,Thomas Kwan,Jingwei Wu,Wenjing Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large-scale search systems, Evaluating relevance, fundamentally constrained, high-throughput requirements, resource-constrained human oversight

备注

点击查看摘要

Abstract:Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \ Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

27. 【2602.07774】Generative Reasoning Re-ranker

链接https://arxiv.org/abs/2602.07774

作者:Mingfu Liang,Yufei Li,Jay Xu,Kavosh Asadi,Xi Liu,Shuo Gu,Kaushik Rangadurai,Frank Shyu,Shuaiwen Wang,Song Yang,Zhijing Li,Jiang Liu,Mengying Sun,Fei Tian,Xiaohan Wei,Chonglin Sun,Jacob Tao,Shike Mei,Hamed Firooz,Wenlin Chen,Luke Simon

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, explore Large Language, Recent studies increasingly, increasingly explore Large, Language Models

备注: 31 pages

点击查看摘要

Abstract:Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

28. 【2602.07773】SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

链接https://arxiv.org/abs/2602.07773

作者:Chen Zhang,Kuicai Dong,Dexun Li,Wenjun Li,Qu Yang,Wei Han,Yong Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:complex question answering, Recent deep search, Recent deep, excel at complex, iteratively planning

备注

点击查看摘要

Abstract:Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

29. 【2602.07739】HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

链接https://arxiv.org/abs/2602.07739

作者:Hiren Madhu,Ngoc Bui,Ali Maatouk,Leandros Tassiulas,Smita Krishnaswamy,Menglin Yang,Sukanta Ganguly,Kiran Srinivasan,Rex Ying

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:remain largely confined, Embedding geometry plays, Euclidean embeddings, Euclidean embeddings fail, retrieval-augmented generation

备注

点击查看摘要

Abstract:Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.

30. 【2602.07695】EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge

链接https://arxiv.org/abs/2602.07695

作者:Congcong Hu,Yuang Shi,Fan Huang,Yang Xiang,Zhou Ye,Ming Jin,Shiyu Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:impacting inventory planning, directly impacting inventory, fulfillment scheduling, impacting inventory, inventory planning

备注

点击查看摘要

Abstract:Demand forecasting is a cornerstone of e-commerce operations, directly impacting inventory planning and fulfillment scheduling. However, existing forecasting systems often fail during high-impact periods such as flash sales, holiday campaigns, and sudden policy interventions, where demand patterns shift abruptly and unpredictably. In this paper, we introduce EventCast, a modular forecasting framework that integrates future event knowledge into time-series prediction. Unlike prior approaches that ignore future interventions or directly use large language models (LLMs) for numerical forecasting, EventCast leverages LLMs solely for event-driven reasoning. Unstructured business data, which covers campaigns, holiday schedules, and seller incentives, from existing operational databases, is processed by an LLM that converts it into interpretable textual summaries leveraging world knowledge for cultural nuances and novel event combinations. These summaries are fused with historical demand features within a dual-tower architecture, enabling accurate, explainable, and scalable forecasts. Deployed on real-world e-commerce scenarios spanning 4 countries of 160 regions over 10 months, EventCast achieves up to 86.9% and 97.7% improvement on MAE and MSE compared to the variant without event knowledge, and reduces MAE by up to 57.0% and MSE by 83.3% versus the best industrial baseline during event-driven periods. EventCast has deployed into real-world industrial pipelines since March 2025, offering a practical solution for improving operational decision-making in dynamic e-commerce environments.

31. 【2602.07526】MSN: A Memory-based Sparse Activation Scaling Framework for Large-scale Industrial Recommendation

链接https://arxiv.org/abs/2602.07526

作者:Shikang Wu,Hui Lu,Jinqiu Jin,Zheng Chai,Shiyong Hong,Junjie Zhang,Shanlei Mu,Kaiyuan Ma,Tianyi Liu,Yuchao Zheng,Zhe Wang,Jingjian Lin

类目:Information Retrieval (cs.IR)

关键词:Scaling deep learning, deep learning recommendation, sparse activation scaling, deep learning, memory

备注

点击查看摘要

Abstract:Scaling deep learning recommendation models is an effective way to improve model expressiveness. Existing approaches often incur substantial computational overhead, making them difficult to deploy in large-scale industrial systems under strict latency constraints. Recent sparse activation scaling methods, such as Sparse Mixture-of-Experts, reduce computation by activating only a subset of parameters, but still suffer from high memory access costs and limited personalization capacity due to the large size and small number of experts. To address these challenges, we propose MSN, a memory-based sparse activation scaling framework for recommendation models. MSN dynamically retrieves personalized representations from a large parameterized memory and integrates them into downstream feature interaction modules via a memory gating mechanism, enabling fine-grained personalization with low computational overhead. To enable further expansion of the memory capacity while keeping both computational and memory access costs under control, MSN adopts a Product-Key Memory (PKM) mechanism, which factorizes the memory retrieval complexity from linear time to sub-linear complexity. In addition, normalization and over-parameterization techniques are introduced to maintain balanced memory utilization and prevent memory retrieval collapse. We further design customized Sparse-Gather operator and adopt the AirTopK operator to improve training and inference efficiency in industrial settings. Extensive experiments demonstrate that MSN consistently improves recommendation performance while maintaining high efficiency. Moreover, MSN has been successfully deployed in the Douyin Search Ranking System, achieving significant gains over deployed state-of-the-art models in both offline evaluation metrics and large-scale online A/B test.

32. 【2602.07525】IGMiRAG: Intuition-Guided Retrieval-Augmented Generation with Adaptive Mining of In-Depth Memory

链接https://arxiv.org/abs/2602.07525

作者:Xingliang Hou,Yuyan Liu,Qi Sun,haoxiu wang,Hao Hu,Shaoyi Du,Zhiqiang Tian

类目:Information Retrieval (cs.IR)

关键词:equips large language, large language models, Retrieval-augmented generation, equips large, language models

备注: 29 pages, Information Retrieval

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) equips large language models (LLMs) with reliable knowledge memory. To strengthen cross-text associations, recent research integrates graphs and hypergraphs into RAG to capture pairwise and multi-entity relations as structured links. However, their misaligned memory organization necessitates costly, disjointed retrieval. To address these limitations, we propose IGMiRAG, a framework inspired by human intuition-guided reasoning. It constructs a hierarchical heterogeneous hypergraph to align multi-granular knowledge, incorporating deductive pathways to simulate realistic memory structures. During querying, IGMiRAG distills intuitive strategies via a question parser to control mining depth and memory window, and activates instantaneous memories as anchors using dual-focus retrieval. Mirroring human intuition, the framework guides retrieval resource allocation dynamically. Furthermore, we design a bidirectional diffusion algorithm that navigates deductive paths to mine in-depth memories, emulating human reasoning processes. Extensive evaluations indicate IGMiRAG outperforms the state-of-the-art baseline by 4.8% EM and 5.0% F1 overall, with token costs adapting to task complexity (average 6.3k+, minimum 3.0k+). This work presents a cost-effective RAG paradigm that improves both efficiency and effectiveness.

33. 【2602.07520】MDL: A Unified Multi-Distribution Learner in Large-scale Industrial Recommendation through Tokenization

链接https://arxiv.org/abs/2602.07520

作者:Shanlei Mu,Yuchen Jiang,Shikang Wu,Shiyong Hong,Tianmu Sha,Junjie Zhang,Jie Zhu,Zhe Chen,Zhe Wang,Jingjian Lin

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:adopt multi-scenario learning, recommender systems increasingly, systems increasingly adopt, increasingly adopt multi-scenario, existing approaches suffer

备注: 9 pages, 4 figures

点击查看摘要

Abstract:Industrial recommender systems increasingly adopt multi-scenario learning (MSL) and multi-task learning (MTL) to handle diverse user interactions and contexts, but existing approaches suffer from two critical drawbacks: (1) underutilization of large-scale model parameters due to limited interaction with complex feature modules, and (2) difficulty in jointly modeling scenario and task information in a unified framework. To address these challenges, we propose a unified \textbf{M}ulti-\textbf{D}istribution \textbf{L}earning (MDL) framework, inspired by the "prompting" paradigm in large language models (LLMs). MDL treats scenario and task information as specialized tokens rather than auxiliary inputs or gating signals. Specifically, we introduce a unified information tokenization module that transforms features, scenarios, and tasks into a unified tokenized format. To facilitate deep interaction, we design three synergistic mechanisms: (1) feature token self-attention for rich feature interactions, (2) domain-feature attention for scenario/task-adaptive feature activation, and (3) domain-fused aggregation for joint distribution prediction. By stacking these interactions, MDL enables scenario and task information to "prompt" and activate the model's vast parameter space in a bottom-up, layer-wise manner. Extensive experiments on real-world industrial datasets demonstrate that MDL significantly outperforms state-of-the-art MSL and MTL baselines. Online A/B testing on Douyin Search platform over one month yields +0.0626\% improvement in LT30 and -0.3267\% reduction in change query rate. MDL has been fully deployed in production, serving hundreds of millions of users daily.

34. 【2602.07442】Echoes in the Loop: Diagnosing Risks in LLM-Powered Recommender Systems under Feedback Loops

链接https://arxiv.org/abs/2602.07442

作者:Donguk Park,Dongwon Lee,Yeon-Chang Lee

类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Large language models, multiple functional roles, Large language, language models, data augmentation

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded into recommender systems, where they operate across multiple functional roles such as data augmentation, profiling, and decision making. While prior work emphasizes recommendation performance, the systemic risks of LLMs, such as bias and hallucination, and their propagation through feedback loops remain largely unexplored. In this paper, we propose a role-aware, phase-wise diagnostic framework that traces how these risks emerge, manifest in ranking outcomes, and accumulate over repeated recommendation cycles. We formalize a controlled feedback-loop pipeline that simulates long-term interaction dynamics and enables empirical measurement of risks at the LLM-generated content, ranking, and ecosystem levels. Experiments on widely used benchmarks demonstrate that LLM-based components can amplify popularity bias, introduce spurious signals through hallucination, and lead to polarized and self-reinforcing exposure patterns over time. We plan to release our framework as an open-source toolkit to facilitate systematic risk analysis across diverse LLM-powered recommender systems.

35. 【2602.07361】ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations

链接https://arxiv.org/abs/2602.07361

作者:Long S. T. Nguyen,Quan M. Bui,Tin T. Ngo,Quynh T. N. Vo,Dung N. H. Le,Tho T. Quan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Question Answering, legally interdependent texts, Vietnamese healthcare regulatory, Vietnamese Healthcare, inherently challenging due

备注: Accepted at ACIIDS 2026

点击查看摘要

Abstract:Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at this https URL.

36. 【2602.07333】High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

链接https://arxiv.org/abs/2602.07333

作者:Rajat Arora,Ye Tao,Jianqiang Shen,Ping Liu,Muchen Wu,Qianqi Shen,Benjamin Le,Fedor Borisyuk,Jingwei Wu,Wenjing Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:search activity logs, Large Language Models, Effective personalization, including profiles, professional data

备注

点击查看摘要

Abstract:Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world's largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.

37. 【2602.07309】Semantic Search At LinkedIn

链接https://arxiv.org/abs/2602.07309

作者:Fedor Borisyuk,Sriram Vasudevan,Muchen Wu,Guoyao Li,Benjamin Le,Shaobo Zhang,Qianqi Kay Shen,Yuchin Juan,Kayhan Behdin,Liming Dong,Kaixu Yang,Shusen Jing,Ravi Pothamsetty,Rajat Arora,Sophie Yanying Sheng,Vitaly Abdrashitov,Yang Zhao,Lin Su,Xiaoqing Wang,Chujie Zheng,Sarang Metkar,Rupesh Gupta,Igor Lapchuk,David N. Racca,Madhumitha Mohan,Yanbo Li,Haojun Li,Saloni Gandhi,Xueying Lu,Chetan Bhole,Ali Hooshmand,Xin Yang,Raghavan Muthuregunathan,Jiajun Zhang,Mathew Teoh,Adam Coler,Abhinav Gupta,Xiaojing Ma,Sundara Raman Ramachandran,Morteza Ramezani,Yubo Wang,Lijuan Zhang,Richard Li,Jian Sheng,Chanh Nguyen,Yen-Chi Chen,Chuanrui Zhu,Claire Zhang,Jiahao Xu,Deepti Kulkarni,Qing Lan,Arvind Subramaniam,Ata Fatahibaarzi,Steven Shimizu,Yanning Chen,Zhipeng Wang,Ran He,Zhengze Zhou,Qingquan Song,Yun Dai,Caleb Johnson,Ping Liu,Shaghayegh Gharghabi,Gokulraj Mohanasundaram,Juan Bottaro,Santhosh Sachindran,Qi Guo,Yunxiang Ren,Chengming Jiang,Di Mo,Luke Simon,Jianqiang Shen,Jingwei Wu,Wenjing Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, Small Language Model, requires major inference, LLM-based semantic search, semantic search framework

备注

点击查看摘要

Abstract:Semantic search with large language models (LLMs) enables retrieval by meaning rather than keyword overlap, but scaling it requires major inference efficiency advances. We present LinkedIn's LLM-based semantic search framework for AI Job Search and AI People Search, combining an LLM relevance judge, embedding-based retrieval, and a compact Small Language Model trained via multi-teacher distillation to jointly optimize relevance and engagement. A prefill-oriented inference architecture co-designed with model pruning, context compression, and text-embedding hybrid interactions boosts ranking throughput by over 75x under a fixed latency constraint while preserving near-teacher-level NDCG, enabling one of the first production LLM-based ranking systems with efficiency comparable to traditional approaches and delivering significant gains in quality and user engagement.

38. 【2602.07307】LIT-GRAPH: Evaluating Deep vs. Shallow Graph Embeddings for High-Quality Text Recommendation in Domain-Specific Knowledge Graphs

链接https://arxiv.org/abs/2602.07307

作者:Nirmal Gelal,Chloe Snow,Kathleen M. Jagodnik,Ambyr Rios,Hande Küçük McGinty

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:study presents LIT-GRAPH, pedagogically aligned instructional, high school English, school English teachers, scaffold high school

备注

点击查看摘要

Abstract:This study presents LIT-GRAPH (Literature Graph for Recommendation and Pedagogical Heuristics), a novel knowledge graph-based recommendation system designed to scaffold high school English teachers in selecting diverse, pedagogically aligned instructional literature. The system is built upon an ontology for English literature, addressing the challenge of curriculum stagnation, where we compare four graph embedding paradigms: DeepWalk, Biased Random Walk (BRW), Hybrid (concatenated DeepWalk and BRW vectors), and the deep model Relational Graph Convolutional Network (R-GCN). Results reveal a critical divergence: while shallow models excelled in structural link prediction, R-GCN dominated semantic ranking. By leveraging relation-specific message passing, the deep model prioritizes pedagogical relevance over raw connectivity, resulting in superior, high-quality, domain-specific recommendations.

39. 【2602.07298】Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

链接https://arxiv.org/abs/2602.07298

作者:Benyu Zhang,Qiang Zhang,Jianpeng Cheng,Hong-You Chen,Qifei Wang,Wei Sun,Shen Li,Jia Li,Jiahao Wu,Xiangjun Fan,Hong Yan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, optimizing resource allocation, Language Models, represent a promising

备注

点击查看摘要

Abstract:Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.

40. 【2602.07297】Progressive Searching for Retrieval in RAG

链接https://arxiv.org/abs/2602.07297

作者:Taehee Jeong,Xingzhe Zhao,Peizu Li,Markus Valvur,Weihua Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Retrieval Augmented Generation, Augmented Generation, large language models, language models, Retrieval Augmented

备注

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a promising technique for mitigating two key limitations of large language models (LLMs): outdated information and hallucinations. RAG system stores documents as embedding vectors in a database. Given a query, search is executed to find the most related documents. Then, the topmost matching documents are inserted into LLMs' prompt to generate a response. Efficient and accurate searching is critical for RAG to get relevant information. We propose a cost-effective searching algorithm for retrieval process. Our progressive searching algorithm incrementally refines the candidate set through a hierarchy of searches, starting from low-dimensional embeddings and progressing into a higher, target-dimensionality. This multi-stage approach reduces retrieval time while preserving the desired accuracy. Our findings demonstrate that progressive search in RAG systems achieves a balance between dimensionality, speed, and accuracy, enabling scalable and high-performance retrieval even for large databases.

41. 【2602.07208】Sequences as Nodes for Contrastive Multimodal Graph Recommendation

链接https://arxiv.org/abs/2602.07208

作者:Bucher Sahyouni,Matthew Vowels,Liqun Chen,Simon Hadfield

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:recommender systems, Sequence-Item Contrastive Recommender, contrastive techniques, data sparsity issues, numerous multimodal

备注

点击查看摘要

Abstract:To tackle cold-start and data sparsity issues in recommender systems, numerous multimodal, sequential, and contrastive techniques have been proposed. While these augmentations can boost recommendation performance, they tend to add noise and disrupt useful semantics. To address this, we propose MuSICRec (Multimodal Sequence-Item Contrastive Recommender), a multi-view graph-based recommender that combines collaborative, sequential, and multimodal signals. We build a sequence-item (SI) view by attention pooling over the user's interacted items to form sequence nodes. We propagate over the SI graph, obtaining a second view organically as an alternative to artificial data augmentation, while simultaneously injecting sequential context signals. Additionally, to mitigate modality noise and align the multimodal information, the contribution of text and visual features is modulated according to an ID-guided gate. We evaluate under a strict leave-two-out split against a broad range of sequential, multimodal, and contrastive baselines. On the Amazon Baby, Sports, and Electronics datasets, MuSICRec outperforms state-of-the-art baselines across all model types. We observe the largest gains for short-history users, mitigating sparsity and cold-start challenges. Our code is available at this https URL and will be made publicly available.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.07208 [cs.IR]

(or
arXiv:2602.07208v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.07208

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
42. 【2602.07207】Multimodal Enhancement of Sequential Recommendation

链接https://arxiv.org/abs/2602.07207

作者:Bucher Sahyouni,Matthew Vowels,Liqun Chen,Simon Hadfield

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Sequential Transformer-based Recommendation, Sequential Transformer-based, Transformer-based Recommendation, recommender framework, Multimodal and Sequential

备注

点击查看摘要

Abstract:We propose a novel recommender framework, MuSTRec (Multimodal and Sequential Transformer-based Recommendation), that unifies multimodal and sequential recommendation paradigms. MuSTRec captures cross-item similarities and collaborative filtering signals, by building item-item graphs from extracted text and visual features. A frequency-based self-attention module additionally captures the short- and long-term user preferences. Across multiple Amazon datasets, MuSTRec demonstrates superior performance (up to 33.5% improvement) over multimodal and sequential state-of-the-art baselines. Finally, we detail some interesting facets of this new recommendation paradigm. These include the need for a new data partitioning regime, and a demonstration of how integrating user embeddings into sequential recommendation leads to drastically increased short-term metrics (up to 200% improvement) on smaller datasets. Our code is availabe at this https URL and will be made publicly available.

43. 【2602.07125】Reasoning-Augmented Representations for Multimodal Retrieval

链接https://arxiv.org/abs/2602.07125

作者:Jianrui Zhang,Anirudh Sundara Rajan,Brandon Han,Soochahn Lee,Sukanta Ganguly,Yong Jae Lee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Universal Multimodal Retrieval, models remain brittle, require latent reasoning, Universal Multimodal, queries require latent

备注

点击查看摘要

Abstract:Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at this https URL.

44. 【2602.07664】Assessing the impact of Open Research Information Infrastructures using NLP driven full-text Scientometrics: A case study of the LXCat open-access platform

链接https://arxiv.org/abs/2602.07664

作者:Kalp Pandya,Khushi Shah,Nirmal Shah,Nakshi Shah,Bhaskar Chaudhury

类目:Plasma Physics (physics.plasm-ph); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Open research information, Open research, knowledge is produced, ORI, ORI infrastructures

备注

点击查看摘要

Abstract:Open research information (ORI) play a central role in shaping how scientific knowledge is produced, disseminated, validated, and reused across the research lifecycle. While the visibility of such ORI infrastructures is often assessed through citation-based metrics, in this study, we present a full-text, natural language processing (NLP) driven scientometric framework to systematically quantify the impact of ORI infrastructures beyond citation counts, using the LXCat platform for low temperature plasma (LTP) research as a representative case study. The modeling of LTPs and interpretation of LTP experiments rely heavily on accurate data, much of which is hosted on LXCat, a community-driven, open-access platform central to the LTP research ecosystem. To investigate the scholarly impact of the LXCat platform over the past decade, we analyzed a curated corpus of full-text research articles citing three foundational LXCat publications. We present a comprehensive pipeline that integrates chemical entity recognition, dataset and solver mention extraction, affiliation based geographic mapping and topic modeling to extract fine-grained patterns of data usage that reflect implicit research priorities, data practices, differential reliance on specific databases, evolving modes of data reuse and coupling within scientific workflows, and thematic evolution. Importantly, our proposed methodology is domain-agnostic and transferable to other ORI contexts, and highlights the utility of NLP in quantifying the role of scientific data infrastructures and offers a data-driven reflection on how open-access platforms like LXCat contribute to shaping research directions. This work presents a scalable scientometric framework that has the potential to support evidence based evaluation of ORI platforms and to inform infrastructure design, governance, sustainability, and policy for future development.

计算机视觉

1. 【2602.09024】Autoregressive Image Generation with Masked Bit Modeling

链接https://arxiv.org/abs/2602.09024

作者:Qihang Yu,Qihao Liu,Ju He,Xinyang Zhang,Yang Liu,Liang-Chieh Chen,Xi Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper challenges, challenges the dominance, pipelines in visual, discrete, continuous

备注: SOTA discrete visual generation defeats diffusion models with 0.99 FID score, project page is available at [this https URL](https://bar-gen.github.io/)

点击查看摘要

Abstract:This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at this https URL

2. 【2602.09022】WorldCompass: Reinforcement Learning for Long-Horizon World Models

链接https://arxiv.org/abs/2602.09022

作者:Zehan Wang,Tengfei Wang,Haiyu Zhang,Xuhui Zuo,Junta Wu,Haoyuan Wang,Wenqiang Sun,Zhenwei Wang,Chenjie Cao,Hengshuang Zhao,Chunchao Guo,Zhou Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Learning, interactive video-based world, work presents WorldCompass, video-based world models, post-training framework

备注: Project page: \url{ [this https URL](https://3d-models.hunyuan.tencent.com/world/) }

点击查看摘要

Abstract:This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

3. 【2602.09021】$χ_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies

链接https://arxiv.org/abs/2602.09021

作者:Checheng Yu,Chonghao Sima,Gangcheng Jiang,Hai Zhang,Haoguang Mai,Hongyang Li,Huijie Wang,Jin Chen,Kaiyang Wu,Li Chen,Lirui Zhao,Modi Shi,Ping Luo,Qingwen Bu,Shijia Peng,Tianyu Li,Yibo Yuan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:complex real-world dynamics, understand complex real-world, traditionally relied, relied on large-scale, compute to understand

备注

点击查看摘要

Abstract:High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $\chi_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $\chi_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $\chi_{0}$ surpasses the state-of-the-art $\pi_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.

4. 【2602.09018】Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

链接https://arxiv.org/abs/2602.09018

作者:Amir Mallak,Alaa Maalouf

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:hiding what breaks, breaks a policy, single number, policies, sim

备注

点击查看摘要

Abstract:Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

5. 【2602.09016】Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

链接https://arxiv.org/abs/2602.09016

作者:Hao Phung,Hadar Averbuch-Elor

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structured vector-graphics representation, CAD workflows, Reconstructing a structured, understanding or CAD, computational tasks involving

备注: Code: [this https URL](https://anonymous.4open.science/r/Raster2Seq-BE73/)

点击查看摘要

Abstract:Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

6. 【2602.09014】ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

链接https://arxiv.org/abs/2602.09014

作者:Zihan Yang(1),Shuyuan Tu(1),Licheng Zhang(1),Qi Dai(2),Yu-Gang Jiang(1),Zuxuan Wu(1) ((1) Fudan University, (2) Microsoft Research Asia)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:motivating recent efforts, achieved remarkable generation, inference cost due, multiple sequential denoising, remarkable generation quality

备注

点击查看摘要

Abstract:Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.

7. 【2602.09013】Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

链接https://arxiv.org/abs/2602.09013

作者:Hongyi Chen,Tony Dong,Tiancheng Wu,Liquan Wang,Yash Jangir,Yaru Niu,Yufei Ye,Homanga Bharadhwaj,Zackory Erickson,Jeffrey Ichnowski

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:high-dimensional action space, Multi-finger robotic hand, acquiring large-scale training, Multi-finger robotic, challenging due

备注

点击查看摘要

Abstract:Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at this http URL.

8. 【2602.09007】GEBench: Benchmarking Image Generation Models as GUI Environments

链接https://arxiv.org/abs/2602.09007

作者:Haodong Li,Jingwei Wu,Quan Sun,Guopeng Li,Juanxi Tian,Huanyu Zhang,Yanlin Lai,Ruichuan An,Hongbo Peng,Yuhong Dai,Chenxi Li,Chunmei Qing,Jia Wang,Ziyang Meng,Zheng Ge,Xiangyu Zhang,Daxin Jiang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, future Graphical User, User Interface, Graphical User, Recent advancements

备注: 23 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: this https URL.

9. 【2602.08996】Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

链接https://arxiv.org/abs/2602.08996

作者:Arushi Rai,Adriana Kovashka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced reasoning capabilities, prior work shows, sports feedback, reasoning capabilities, prior work

备注: to appear WACV 2026

点击查看摘要

Abstract:While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.

10. 【2602.08971】WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

链接https://arxiv.org/abs/2602.08971

作者:Yu Shang,Zhuohang Li,Yiding Ma,Weikang Su,Xin Jin,Ziyou Wang,Xin Zhang,Yinzhou Tang,Chen Gao,Wei Wu,Xihui Liu,Dhruv Shah,Zhaoxiang Zhang,Zhibo Chen,Jun Zhu,Yonghong Tian,Tat-Seng Chua,Wenwu Zhu,Yong Li

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:evaluation remains fragmented, world models, action-conditioned prediction, remains fragmented, embodied world models

备注

点击查看摘要

Abstract:While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at this https URL, providing a framework for tracking progress toward truly functional world models in embodied AI.

11. 【2602.08962】Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting

链接https://arxiv.org/abs/2602.08962

作者:Guangxun Zhu,Xuan Liu,Nicolas Pugeault,Chongfeng Wei,Edmond S. L. Ho

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Accurately predicting pedestrian, complex urban environments, Accurately predicting, urban environments, crucial for safe

备注: Accepted for IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: this https URL

12. 【2602.08961】MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

链接https://arxiv.org/abs/2602.08961

作者:Ruijie Zhu,Jiahao Lu,Wenbo Hu,Xiaoguang Han,Jianfei Cai,Ying Shan,Chuanxia Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG)

关键词:video diffusion-based framework, video diffusion-based, monocular video, estimates dense motion, jointly reconstructs

备注: Project page: [this https URL](https://ruijiezhu94.github.io/MotionCrafter_Page)

点击查看摘要

Abstract:We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: this https URL

13. 【2602.08958】Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields

链接https://arxiv.org/abs/2602.08958

作者:Weihan Luo,Lily Goli,Sherwin Bahmani,Felix Taubner,Andrea Tagliasacchi,David B. Lindell

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses unique challenges, growth poses unique, unique challenges, poses unique, Gaussian

备注: Project page: [this https URL](https://weihanluo.ca/growflow/)

点击查看摘要

Abstract:Modeling the time-varying 3D appearance of plants during their growth poses unique challenges: unlike many dynamic scenes, plants generate new geometry over time as they expand, branch, and differentiate. Recent motion modeling techniques are ill-suited to this problem setting. For example, deformation fields cannot introduce new geometry, and 4D Gaussian splatting constrains motion to a linear trajectory in space and time and cannot track the same set of Gaussians over time. Here, we introduce a 3D Gaussian flow field representation that models plant growth as a time-varying derivative over Gaussian parameters -- position, scale, orientation, color, and opacity -- enabling nonlinear and continuous-time growth dynamics. To initialize a sufficient set of Gaussian primitives, we reconstruct the mature plant and learn a process of reverse growth, effectively simulating the plant's developmental history in reverse. Our approach achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth, providing a new approach for appearance modeling of growing 3D structures.

14. 【2602.08909】Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit

链接https://arxiv.org/abs/2602.08909

作者:Zhendong Wang,Cihan Ruan,Jingchuan Xiao,Chuqing Shi,Wei Jiang,Wei Wang,Wenjie Liu,Nam Ling

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Gaussian Splatting, standard multi-view optimization, solutions from standard, investigate what structure, structure emerges

备注

点击查看摘要

Abstract:We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.

15. 【2602.08882】Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

链接https://arxiv.org/abs/2602.08882

作者:Puqi Zhou(1),Ali Asgarov(2),Aafiya Hussain(2),Wonjoon Park(3),Amit Paudyal(1),Sameep Shrestha(1),Chia-wei Tang(2),Michael F. Lighthiser(1),Michael R. Hieb(1),Xuesu Xiao(1),Chris Thomas(2),Sungsoo Ray Hong(1) ((1) George Mason University, Fairfax, VA, USA (2) Virginia Tech, Blacksburg, VA, USA (3) University of Maryland, College Park, MD, USA)

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:reducing professionals' burden, providing scalable situational, scalable situational awareness, advance public safety, public safety

备注

点击查看摘要

Abstract:Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools. The testbed is available at this https URL\_VideoSensemaking

16. 【2602.08861】FRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

链接https://arxiv.org/abs/2602.08861

作者:Xiangtian Zheng,Zishuo Wang,Yuxin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multi-Modal Large Language, Language Models, Large Language, Video Multi-Modal Large

备注

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.

17. 【2602.08858】FlattenGPT: Depth Compression for Transformer with Layer Flattening

链接https://arxiv.org/abs/2602.08858

作者:Ruihan Xu,Qingpei Guo,Yao Zhu,Xiangyang Ji,Ming Yang,Shiliang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent works, prompting the research, prune less crucial, Recent, blocks

备注: Submitted to ICML 2026

点击查看摘要

Abstract:Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.

18. 【2602.08828】VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

链接https://arxiv.org/abs/2602.08828

作者:Hao Tan,Jun Lan,Senyuan Shi,Zichang Tan,Zijian Yu,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:escalating security risks, generation poses escalating, poses escalating security, detection increasingly essential, making reliable detection

备注: Project: [this https URL](https://github.com/EricTan7/VideoVeritas)

点击查看摘要

Abstract:The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.

19. 【2602.08822】Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications

链接https://arxiv.org/abs/2602.08822

作者:Yao Pu,Yiming Shi,Zhenxi Zhang,Peixin Yu,Yitao Zhuang,Xiang Wang,Hongzhao Chen,Jing Cai,Ge Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic resonance imaging, long scan times, Magnetic resonance, resonance imaging, nasopharyngeal carcinoma

备注

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC's RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.

20. 【2602.08820】Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

链接https://arxiv.org/abs/2602.08820

作者:Hao Yang,Zhiyu Tan,Jia Gong,Luozheng Qin,Hesen Chen,Xiaomeng Yang,Yuqing Sun,Yuetan Lin,Mengping Yang,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computationally efficient model, scalable and computationally, computationally efficient, pretrained multimodal large-language, connects pretrained multimodal

备注: Technical Report, Project: [this https URL](https://howellyoung-s.github.io/Omni-Video2-project/)

点击查看摘要

Abstract:We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

21. 【2602.08797】Addressing data annotation scarcity in Brain Tumor Segmentation on 3D MRI scan Using a Semi-Supervised Teacher-Student Framework

链接https://arxiv.org/abs/2602.08797

作者:Jiaming Liu,Cheng Ding,Daoqiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate brain tumor, Accurate brain, scanners and sites, expensive annotations, heterogeneity across scanners

备注: 10 pages, 7 figures. Submitted to IEEE Journal of Biomedical and Health Informatics (JBHI)

点击查看摘要

Abstract:Accurate brain tumor segmentation from MRI is limited by expensive annotations and data heterogeneity across scanners and sites. We propose a semi-supervised teacher-student framework that combines an uncertainty-aware pseudo-labeling teacher with a progressive, confidence-based curriculum for the student. The teacher produces probabilistic masks and per-pixel uncertainty; unlabeled scans are ranked by image-level confidence and introduced in stages, while a dual-loss objective trains the student to learn from high-confidence regions and unlearn low-confidence ones. Agreement-based refinement further improves pseudo-label quality. On BraTS 2021, validation DSC increased from 0.393 (10% data) to 0.872 (100%), with the largest gains in early stages, demonstrating data efficiency. The teacher reached a validation DSC of 0.922, and the student surpassed the teacher on tumor subregions (e.g., NCR/NET 0.797 and Edema 0.980); notably, the student recovered the Enhancing class (DSC 0.620) where the teacher failed. These results show that confidence-driven curricula and selective unlearning provide robust segmentation under limited supervision and noisy pseudo-labels.

22. 【2602.08794】MOVA: Towards Scalable and Synchronized Video-Audio Generation

链接https://arxiv.org/abs/2602.08794

作者:SII-OpenMOSS Team:Donghua Yu,Mingshu Chen,Qi Chen,Qi Luo,Qianyi Wu,Qinyuan Cheng,Ruixiao Li,Tianyi Liang,Wenbo Zhang,Wenming Tu,Xiangyu Peng,Yang Gao,Yanru Huo,Ying Zhu,Yinze Luo,Yiyang Zhang,Yuerong Song,Zhe Xu,Zhiyu Zhang,Chenchen Yang,Cheng Chang,Chushu Zhou,Hanfu Chen,Hongnan Ma,Jiaxi Li,Jingqi Tong,Junxi Liu,Ke Chen,Shimin Li,Songlin Wang,Wei Jiang,Zhaoye Fei,Zhiyuan Ning,Chunguo Li,Chenhui Li,Ziwei He,Zengfeng Huang,Xie Chen,Xipeng Qiu

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:overlooked audio components, largely overlooked audio, indispensable for real-world, largely overlooked, audio components

备注: Technical report for MOVA (open-source video-audio generation model). 38 pages, 10 figures, 22 tables. Project page: [this https URL](https://mosi.cn/models/mova) Code: [this https URL](https://github.com/OpenMOSS/MOVA) Models: [this https URL](https://huggingface.co/collections/OpenMOSS-Team/mova) . Qinyuan Cheng and Tianyi Liang are project leader. Xie Chen and Xipeng Qiu are corresponding authors

点击查看摘要

Abstract:Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

23. 【2602.08792】Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems

链接https://arxiv.org/abs/2602.08792

作者:Hao Dong,Eleni Chatzi,Olga Fink

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:reliable power delivery, electrified rail systems, essential for ensuring, ensuring uninterrupted, uninterrupted and reliable

备注

点击查看摘要

Abstract:The pantograph-catenary interface is essential for ensuring uninterrupted and reliable power delivery in electrified rail systems. However, electrical arcing at this interface poses serious risks, including accelerated wear of contact components, degraded system performance, and potential service disruptions. Detecting arcing events at the pantograph-catenary interface is challenging due to their transient nature, noisy operating environment, data scarcity, and the difficulty of distinguishing arcs from other similar transient phenomena. To address these challenges, we propose a novel multimodal framework that combines high-resolution image data with force measurements to more accurately and robustly detect arcing events. First, we construct two arcing detection datasets comprising synchronized visual and force measurements. One dataset is built from data provided by the Swiss Federal Railways (SBB), and the other is derived from publicly available videos of arcing events in different railway systems and synthetic force data that mimic the characteristics observed in the real dataset. Leveraging these datasets, we propose MultiDeepSAD, an extension of the DeepSAD algorithm for multiple modalities with a new loss formulation. Additionally, we introduce tailored pseudo-anomaly generation techniques specific to each data type, such as synthetic arc-like artifacts in images and simulated force irregularities, to augment training data and improve the discriminative ability of the model. Through extensive experiments and ablation studies, we demonstrate that our framework significantly outperforms baseline approaches, exhibiting enhanced sensitivity to real arcing events even under domain shifts and limited availability of real arcing observations.

24. 【2602.08775】VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars

链接https://arxiv.org/abs/2602.08775

作者:Vineet Kumar Rakesh,Ahana Bhattacharjee,Soumya Mazumdar,Tapas Samanta,Hemendra Kumar Pandey,Amitabha Das,Sarbajit Pal

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

关键词:improved engagement, increasingly adopted, technology to deliver, deliver content, content with social

备注

点击查看摘要

Abstract:Talking-head avatars are increasingly adopted in educational technology to deliver content with social presence and improved engagement. However, many recent talking-head generation (THG) methods rely on GPU-centric neural rendering, large training sets, or high-capacity diffusion models, which limits deployment in offline or resource-constrained learning environments. A deterministic and CPU-oriented THG framework is described, termed Symbolic Vedic Computation, that converts speech to a time-aligned phoneme stream, maps phonemes to a compact viseme inventory, and produces smooth viseme trajectories through symbolic coarticulation inspired by Vedic sutra Urdhva Tiryakbhyam. A lightweight 2D renderer performs region-of-interest (ROI) warping and mouth compositing with stabilization to support real-time synthesis on commodity CPUs. Experiments report synchronization accuracy, temporal stability, and identity consistency under CPU-only execution, alongside benchmarking against representative CPU-feasible baselines. Results indicate that acceptable lip-sync quality can be achieved while substantially reducing computational load and latency, supporting practical educational avatars on low-end hardware. GitHub: this https URL

25. 【2602.08753】MVAnimate: Enhancing Character Animation with Multi-View Optimization

链接https://arxiv.org/abs/2602.08753

作者:Tianyu Sun,Zhoujie Fu,Bang Zhang,Guosheng Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demand for realistic, realistic and versatile, wide-ranging applications, versatile character animation, multi-view prior information

备注

点击查看摘要

Abstract:The demand for realistic and versatile character animation has surged, driven by its wide-ranging applications in various domains. However, the animation generation algorithms modeling human pose with 2D or 3D structures all face various problems, including low-quality output content and training data deficiency, preventing the related algorithms from generating high-quality animation videos. Therefore, we introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information, to enhance the generated video quality. Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs, demonstrating improvements over existing animation methods. Our MVAnimate also optimizes the multi-view videos of the target character, enhancing the video quality from different views. Experimental results on diverse datasets highlight the robustness of our method in handling various motion patterns and appearances.

26. 【2602.08749】Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

链接https://arxiv.org/abs/2602.08749

作者:Carmine Zaccagnino,Fabio Quattrini,Enis Simsar,Marta Tintoré Gazulla,Rita Cucchiara,Alessio Tonioni,Silvia Cascianelli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Flow matching models, offering faster inference, Flow matching, text-guided image generation, alternative to diffusion

备注

点击查看摘要

Abstract:Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

27. 【2602.08735】From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

链接https://arxiv.org/abs/2602.08735

作者:Masanari Oi,Koki Maeda,Ryuto Koike,Daisuke Oba,Nakamasa Inoue,Naoaki Okazaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, made substantial progress, remains challenging, large language models, multi-image spatial reasoning

备注

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

28. 【2602.08730】Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation

链接https://arxiv.org/abs/2602.08730

作者:Shanshan Wang,Ziying Feng,Xiaozheng Shen,Xun Yang,Pichao Wang,Zhenwei He,Xingyi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pre-trained source model, data security, source model, tackles the problem, unlabeled target domain

备注

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at this https URL

29. 【2602.08727】Artifact Reduction in Undersampled 3D Cone-Beam CTs using a Hybrid 2D-3D CNN Framework

链接https://arxiv.org/abs/2602.08727

作者:Johannes Thalhammer,Tina Dorosti,Sebastian Peterhansl,Daniela Pfeiffer,Franz Pfeiffer,Florian Schaff

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:minimize acquisition time, introduce artifacts degrading, volumes minimize acquisition, degrading image quality, diagnostic utility

备注

点击查看摘要

Abstract:Undersampled CT volumes minimize acquisition time and radiation exposure but introduce artifacts degrading image quality and diagnostic utility. Reducing these artifacts is critical for high-quality imaging. We propose a computationally efficient hybrid deep-learning framework that combines the strengths of 2D and 3D models. First, a 2D U-Net operates on individual slices of undersampled CT volumes to extract feature maps. These slice-wise feature maps are then stacked across the volume and used as input to a 3D decoder, which utilizes contextual information across slices to predict an artifact-free 3D CT volume. The proposed two-stage approach balances the computational efficiency of 2D processing with the volumetric consistency provided by 3D modeling. The results show substantial improvements in inter-slice consistency in coronal and sagittal direction with low computational overhead. This hybrid framework presents a robust and efficient solution for high-quality 3D CT image post-processing. The code of this project can be found on github: this https URL.

30. 【2602.08726】SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training

链接https://arxiv.org/abs/2602.08726

作者:Khadija Iddrisu,Waseem Shariff,Suzanne Little,Noel OConnor

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cognition and perception, fundamental to understanding, understanding the mechanisms, mechanisms of human, human cognition

备注: Accepted to the 2nd Workshop on "Event-based Vision in the Era of Generative AI - Transforming Perception and Visual Innovation, IEEE Winter Conference on Applications of Computer Vision (WACV 2026)

点击查看摘要

Abstract:The study of eye movements, particularly saccades and fixations, are fundamental to understanding the mechanisms of human cognition and perception. Accurate classification of these movements requires sensing technologies capable of capturing rapid dynamics without distortion. Event cameras, also known as Dynamic Vision Sensors (DVS), provide asynchronous recordings of changes in light intensity, thereby eliminating motion blur inherent in conventional frame-based cameras and offering superior temporal resolution and data efficiency. In this study, we introduce a synthetic dataset generated with Blender to simulate saccades and fixations under controlled conditions. Leveraging Spiking Neural Networks (SNNs), we evaluate its robustness by training two architectures and finetuning on real event data. The proposed models achieve up to 0.83 accuracy and maintain consistent performance across varying temporal resolutions, demonstrating stability in eye movement classification. Moreover, the use of SNNs with synthetic event streams yields substantial computational efficiency gains over artificial neural network (ANN) counterparts, underscoring the utility of synthetic data augmentation in advancing event-based vision. All code and datasets associated with this work is available at https: //github.com/Ikhadija-5/SynSacc-Dataset.

31. 【2602.08725】FusionEdit: Semantic Fusion and Attention Modulation for Training-Free Image Editing

链接https://arxiv.org/abs/2602.08725

作者:Yongwen Lai,Chaoqun Wang,Shaobo Min

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text-guided image editing, Text-guided image, modify specific regions, aims to modify, modify specific

备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Text-guided image editing aims to modify specific regions according to the target prompt while preserving the identity of the source image. Recent methods exploit explicit binary masks to constrain editing, but hard mask boundaries introduce artifacts and reduce editability. To address these issues, we propose FusionEdit, a training-free image editing framework that achieves precise and controllable edits. First, editing and preserved regions are automatically identified by measuring semantic discrepancies between source and target prompts. To mitigate boundary artifacts, FusionEdit performs distance-aware latent fusion along region boundaries to yield the soft and accurate mask, and employs a total variation loss to enforce smooth transitions, obtaining natural editing results. Second, FusionEdit leverages AdaIN-based modulation within DiT attention layers to perform a statistical attention fusion in the editing region, enhancing editability while preserving global consistency with the source image. Extensive experiments demonstrate that our FusionEdit significantly outperforms state-of-the-art methods. Code is available at \href{this https URL}{this https URL}.

32. 【2602.08724】Rotated Lights for Consistent and Efficient 2D Gaussians Inverse Rendering

链接https://arxiv.org/abs/2602.08724

作者:Geng Lin,Matthias Zwicker

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Inverse rendering aims, Inverse rendering, aims to decompose, inverse rendering methods, rendering model

备注: Project Page: [this https URL](https://rotlight-ir.github.io/)

点击查看摘要

Abstract:Inverse rendering aims to decompose a scene into its geometry, material properties and light conditions under a certain rendering model. It has wide applications like view synthesis, relighting, and scene editing. In recent years, inverse rendering methods have been inspired by view synthesis approaches like neural radiance fields and Gaussian splatting, which are capable of efficiently decomposing a scene into its geometry and radiance. They then further estimate the material and lighting that lead to the observed scene radiance. However, the latter step is highly ambiguous and prior works suffer from inaccurate color and baked shadows in their albedo estimation albeit their regularization. To this end, we propose RotLight, a simple capturing setup, to address the ambiguity. Compared to a usual capture, RotLight only requires the object to be rotated several times during the process. We show that as few as two rotations is effective in reducing artifacts. To further improve 2DGS-based inverse rendering, we additionally introduce a proxy mesh that not only allows accurate incident light tracing, but also enables a residual constraint and improves global illumination handling. We demonstrate with both synthetic and real world datasets that our method achieves superior albedo estimation while keeping efficient computation.

33. 【2602.08717】Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images

链接https://arxiv.org/abs/2602.08717

作者:Farnaz Khun Jush,Grit Werner,Mark Klemens,Matthias Lenga

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:unreliable DICOM metadata, medical imaging workflows, automated medical imaging, remain heavily dependent, existing solutions remain

备注: 8 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Reliable identification of anatomical body regions is a prerequisite for many automated medical imaging workflows, yet existing solutions remain heavily dependent on unreliable DICOM metadata. Current solutions mainly use supervised learning, which limits their applicability in many real-world scenarios. In this work, we investigate whether body region detection in volumetric CT and MR images can be achieved in a fully zero-shot manner by using knowledge embedded in large pre-trained foundation models. We propose and systematically evaluate three training-free pipelines: (1) a segmentation-driven rule-based system leveraging pre-trained multi-organ segmentation models, (2) a Multimodal Large Language Model (MLLM) guided by radiologist-defined rules, and (3) a segmentation-aware MLLM that combines visual input with explicit anatomical evidence. All methods are evaluated on 887 heterogeneous CT and MR scans with manually verified anatomical region labels. The segmentation-driven rule-based approach achieves the strongest and most consistent performance, with weighted F1-scores of 0.947 (CT) and 0.914 (MR), demonstrating robustness across modalities and atypical scan coverage. The MLLM performs competitively in visually distinctive regions, while the segmentation-aware MLLM reveals fundamental limitations.

34. 【2602.08713】owards Understanding Multimodal Fine-Tuning: Spatial Features

链接https://arxiv.org/abs/2602.08713

作者:Lachin Naghashyar,Hunar Batra,Ashkan Khakzar,Philip Torr,Ronald Clark,Christian Schroeder de Witt,Constantin Venhoff

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Contemporary Vision-Language Models, achieve strong performance, Contemporary Vision-Language, achieve strong, fine-tuned for visual-text

备注

点击查看摘要

Abstract:Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

35. 【2602.08711】meChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

链接https://arxiv.org/abs/2602.08711

作者:Linli Yao,Yuancheng Wei,Yaojie Zhang,Lei Li,Xinlong Chen,Feifan Song,Ziyue Wang,Kun Ouyang,Yuanxin Liu,Lingpeng Kong,Qi Liu,Pengfei Wan,Kun Gai,Yuanxing Zhang,Xu Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Omni Dense Captioning, paper proposes Omni, proposes Omni Dense, structured audio-visual narratives, Dense Captioning

备注

点击查看摘要

Abstract:This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at this https URL.

36. 【2602.08699】Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm

链接https://arxiv.org/abs/2602.08699

作者:Xiaogang Xu,Kun Zhou,Tao Hu,Jiafei Wu,Ruixing Wang,Hao Peng,Bei Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Light Video Enhancement, View-aware Low-light Video, seeks to restore, invisibility and noise, static scenes plagued

备注

点击查看摘要

Abstract:Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.

37. 【2602.08683】OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

链接https://arxiv.org/abs/2602.08683

作者:Feilong Tang,Xiang An,Yunyao Yan,Yin Xie,Bin Qin,Kaicheng Yang,Yifei Shen,Yuanhan Zhang,Chunyuan Li,Shikun Feng,Changrui Chen,Huajie Tan,Ming Hu,Manyuan Zhang,Bo Li,Ziyong Feng,Ziwei Liu,Zongyuan Ge,Jiankang Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual, cs.CV, Abstract, Artificial general intelligence, video

备注

点击查看摘要

Abstract:Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.08683 [cs.CV]

(or
arXiv:2602.08683v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.08683

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
38. 【2602.08682】ALIVE: Animate Your World with Lifelike Audio-Video Generation

链接https://arxiv.org/abs/2602.08682

作者:Ying Guo,Qijun Gan,Yifu Zhang,Jinlai Liu,Yifei Hu,Pan Xie,Dongjun Qian,Yu Zhang,Ruiqi Li,Yuqi Zhang,Ruibiao Lu,Xiaofeng Mei,Bo Han,Xiang Yin,Bingyue Peng,Zehuan Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video generation, unified audio-video generation, Sora-style audio-video generation, rapidly evolving, evolving towards unified

备注

点击查看摘要

Abstract:Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-VideoAudio (T2VA) and Reference-to-VideoAudio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: this https URL.

39. 【2602.08670】A Machine Learning accelerated geophysical fluid solver

链接https://arxiv.org/abs/2602.08670

作者:Yang Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Performance (cs.PF); Computational Physics (physics.comp-ph)

关键词:Machine learning methods, natural language processing, Machine learning, language processing, image classification

备注: Master Thesis

点击查看摘要

Abstract:Machine learning methods have been successful in many areas, like image classification and natural language processing. However, it still needs to be determined how to apply ML to areas with mathematical constraints, like solving PDEs. Among various approaches to applying ML techniques to solving PDEs, the data-driven discretization method presents a promising way of accelerating and improving existing PDE solver on structured grids where it predicts the coefficients of quasi-linear stencils for computing values or derivatives of a function at given positions. It can improve the accuracy and stability of low-resolution simulation compared with using traditional finite difference or finite volume schemes. Meanwhile, it can also benefit from traditional numerical schemes like achieving conservation law by adapting finite volume type formulations. In this thesis, we have implemented the shallow water equation and Euler equation classic solver under a different framework. Experiments show that our classic solver performs much better than the Pyclaw solver. Then we propose four different deep neural networks for the ML-based solver. The results indicate that two of these approaches could output satisfactory solutions.

40. 【2602.08661】WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling

链接https://arxiv.org/abs/2602.08661

作者:Yi Dao,Lankai Zhang,Hao Liu,Haiwei Zhang,Wenbo Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Internet of Things, enabling applications ranging, Human pose estimation, enabling applications, human-computer interaction

备注

点击查看摘要

Abstract:Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at this https URL.

41. 【2602.08652】Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology

链接https://arxiv.org/abs/2602.08652

作者:Oskar Thaeter,Tanja Niedermair,Johannes Raffler,Ralf Huss,Peter J. Schüffler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Toggle, Toggle Hugging Face, Code Toggle Papers, Accurate annotation, Code

备注: 17 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Accurate annotation of fixation type is a critical step in slide preparation for pathology laboratories. However, this manual process is prone to errors, impacting downstream analyses and diagnostic accuracy. Existing methods for verifying formalin-fixed, paraffin-embedded (FFPE), and frozen section (FS) fixation types typically require full-resolution whole-slide images (WSIs), limiting scalability for high-throughput quality control. We propose a deep-learning model to predict fixation types using low-resolution, pre-scan thumbnail images. The model was trained on WSIs from the TUM Institute of Pathology (n=1,200, Leica GT450DX) and evaluated on a class-balanced subset of The Cancer Genome Atlas dataset (TCGA, n=8,800, Leica AT2), as well as on class-balanced datasets from Augsburg (n=695 [392 FFPE, 303 FS], Philips UFS) and Regensburg (n=202, 3DHISTECH P1000). Our model achieves an AUROC of 0.88 on TCGA, outperforming comparable pre-scan methods by 4.8%. It also achieves AUROCs of 0.72 on Regensburg and Augsburg slides, underscoring challenges related to scanner-induced domain shifts. Furthermore, the model processes each slide in 21 ms, $400\times$ faster than existing high-magnification, full-resolution methods, enabling rapid, high-throughput processing. This approach provides an efficient solution for detecting labelling errors without relying on high-magnification scans, offering a valuable tool for quality control in high-throughput pathology workflows. Future work will improve and evaluate the model's generalisation to additional scanner types. Our findings suggest that this method can increase accuracy and efficiency in digital pathology workflows and may be extended to other low-resolution slide annotations.

Comments:
17 pages, 8 figures, 7 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.08652 [cs.CV]

(or
arXiv:2602.08652v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.08652

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Oskar Thaeter [view email] [v1]
Mon, 9 Feb 2026 13:46:55 UTC (7,839 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology, by Oskar Thaeter and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

42. 【2602.08632】We Should Separate Memorization from Copyright

链接https://arxiv.org/abs/2602.08632

作者:Adi Haviv,Niva Elkin-Koren,Uri Hacohen,Roi Livni,Shay Moran

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:models has introduced, risk factor, foundation models, copyright, Abstract

备注

点击查看摘要

Abstract:The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.

43. 【2602.08626】Revisiting [CLS] and Patch Token Interaction in Vision Transformers

链接https://arxiv.org/abs/2602.08626

作者:Alexis Marouani,Oriane Siméoni,Hervé Jégou,Piotr Bojanowski,Huy V. Vo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, Transformers have emerged, versatile representation learners, emerged as powerful, scalable and versatile

备注: To be published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.

44. 【2602.08620】Improving Reconstruction of Representation Autoencoder

链接https://arxiv.org/abs/2602.08620

作者:Siyu Liu,Chujie Qin,Hubery Yin,Qixin Yan,Zheng-Peng Duan,Chen Li,Jing Lyu,Chun-Le Guo,Chongyi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Foundation Models, leverages Vision Foundation, Recent work leverages, work leverages Vision, Vision Foundation

备注

点击查看摘要

Abstract:Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at this https URL.

45. 【2602.08615】Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

链接https://arxiv.org/abs/2602.08615

作者:Kfir Goldberg,Elad Richardson,Yael Vinker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:executing carefully crafted, carefully crafted textual, precedes idea formation, crafted textual prompts, offering limited support

备注: Project page available at [this https URL](https://inspirationseedspaper.github.io/InspirationSeeds/)

点击查看摘要

Abstract:While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.

46. 【2602.08613】Overview and Comparison of AVS Point Cloud Compression Standard

链接https://arxiv.org/abs/2602.08613

作者:Wei Gao,Wenxu Gao,Xingming Mu,Changhao Peng,Ge Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point cloud compression, digital heritage protection, Point cloud, cloud compression, data representation format

备注: 3 figures, 3 tables

点击查看摘要

Abstract:Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.

47. 【2602.08582】SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning

链接https://arxiv.org/abs/2602.08582

作者:Melany Yang,Yuhang Yu,Diwang Weng,Jinwei Chen,Wei Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Photorealistic color retouching, visual content creation, manual retouching remains, retouching remains inaccessible, Photorealistic color

备注

点击查看摘要

Abstract:Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at this https URL.

48. 【2602.08558】FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction

链接https://arxiv.org/abs/2602.08558

作者:Guan Yuan Tan,Ngoc Tuan Vu,Arghya Pal,Sailaja Rajanala,Raphael Phan C.-W.,Mettu Srinivas,Chee-Ming Ting

类目:Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)

关键词:Gaussian primitives evolve, single Multilayer Perceptron, framework for generating, scenes by reconstructing, primitives evolve

备注

点击查看摘要

Abstract:We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.

49. 【2602.08550】GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

链接https://arxiv.org/abs/2602.08550

作者:Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词:Human perception, video stream arises, effective object tracking, generic object tracking, knowledge combined

备注: ICLR 2026. This is a preprint version. The camera-ready version will be updated soon

点击查看摘要

Abstract:Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.

50. 【2602.08540】IBR4D: Tracing-Guided Iterative Boundary Refinement for Efficient 4D Gaussian Segmentation

链接https://arxiv.org/abs/2602.08540

作者:He Wu,Xia Yan,Yanghui Xu,Liegang Xia,Jiazhou Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:scenes remains challenging, remains challenging due, Gaussian scenes remains, Object-level segmentation, Gaussian Instance Tracing

备注: 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.

51. 【2602.08531】hegra: Graph-based SLAM for Thermal Imagery

链接https://arxiv.org/abs/2602.08531

作者:Anastasiia Kornilova,Ivan Moskalenko,Arabella Gromova,Gonzalo Ferrer,Alexander Menshchikov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:practical sensing modality, visually degraded environments, adverse weather, practical sensing, sensing modality

备注

点击查看摘要

Abstract:Thermal imaging provides a practical sensing modality for visual SLAM in visually degraded environments such as low illumination, smoke, or adverse weather. However, thermal imagery often exhibits low texture, low contrast, and high noise, complicating feature-based SLAM. In this work, we propose a sparse monocular graph-based SLAM system for thermal imagery that leverages general-purpose learned features -- the SuperPoint detector and LightGlue matcher, trained on large-scale visible-spectrum data to improve cross-domain generalization. To adapt these components to thermal data, we introduce a preprocessing pipeline to enhance input suitability and modify core SLAM modules to handle sparse and outlier-prone feature matches. We further incorporate keypoint confidence scores from SuperPoint into a confidence-weighted factor graph to improve estimation robustness. Evaluations on public thermal datasets demonstrate that the proposed system achieves reliable performance without requiring dataset-specific training or fine-tuning a desired feature detector, given the scarcity of quality thermal data. Code will be made available upon publication.

52. 【2602.08528】Automatic regularization parameter choice for tomography using a double model approach

链接https://arxiv.org/abs/2602.08528

作者:Chuyang Wu,Samuli Siltanen

类目:Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:ill-posed inverse problem, X-ray tomography, Image reconstruction, ill-posed inverse, inverse problem

备注

点击查看摘要

Abstract:Image reconstruction in X-ray tomography is an ill-posed inverse problem, particularly with limited available data. Regularization is thus essential, but its effectiveness hinges on the choice of a regularization parameter that balances data fidelity against a priori information. We present a novel method for automatic parameter selection based on the use of two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts the regularization strength, driving an iterative reconstruction toward the smallest parameter that yields sufficient similarity between reconstructions on the two grids. The effectiveness of the proposed approach is demonstrated using real tomographic data.

53. 【2602.08524】GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving

链接https://arxiv.org/abs/2602.08524

作者:Linger Deng,Yuliang Liu,Wenwen Yu,Zujia Zhang,Jianzhong Ju,Zhenbo Luo,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multimodal Models, Geometry problem-solving remains, Large Multimodal, Geometry problem-solving, challenge for Large

备注

点击查看摘要

Abstract:Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page -- this https URL

54. 【2602.08505】Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

链接https://arxiv.org/abs/2602.08505

作者:Caterina Fuster-Barceló,Virginie Uhlmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:support effective transfer, biomedical image analysis, vision foundation models, vision foundation, increasingly reused

备注

点击查看摘要

Abstract:Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.

55. 【2602.08503】Learning Self-Correction in Vision-Language Models via Rollout Augmentation

链接https://arxiv.org/abs/2602.08503

作者:Yi Ding,Ziliang Qiu,Bolian Li,Ruqi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:vision-language models, solving complex reasoning, complex reasoning problems, essential for solving, solving complex

备注: 17 pages

点击查看摘要

Abstract:Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

56. 【2602.08491】Enhanced Food Category Recognition under Illumination-Induced Domain Shift

链接https://arxiv.org/abs/2602.08491

作者:Keonvin Park,Aditya Pal,Jin Hong Mok

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remains poorly understood, settings remains poorly, video settings remains, achieved strong performance, benchmark datasets

备注

点击查看摘要

Abstract:Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

57. 【2602.08479】Gesture Matters: Pedestrian Gesture Recognition for AVs Through Skeleton Pose Evaluation

链接https://arxiv.org/abs/2602.08479

作者:Alif Rizqullah Mahdi,Mahdi Rezaei,Natasha Merat

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:formal traffic rules, interactions when formal, key component, component of non-verbal, non-verbal communication

备注: 9th International Conference on Instrumentation, Control, and Automation (ICA)

点击查看摘要

Abstract:Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.

58. 【2602.08466】Reliability-aware Execution Gating for Near-field and Off-axis Vision-guided Robotic Alignment

链接https://arxiv.org/abs/2602.08466

作者:Ning Hu,Senhao Cao,Maochen Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:require reliable execution, increasingly deployed, deployed in precision, require reliable, Vision-guided robotic systems

备注: 7 pages, 1 figure

点击查看摘要

Abstract:Vision-guided robotic systems are increasingly deployed in precision alignment tasks that require reliable execution under near-field and off-axis configurations. While recent advances in pose estimation have significantly improved numerical accuracy, practical robotic systems still suffer from frequent execution failures even when pose estimates appear accurate. This gap suggests that pose accuracy alone is insufficient to guarantee execution-level reliability. In this paper, we reveal that such failures arise from a deterministic geometric error amplification mechanism, in which small pose estimation errors are magnified through system structure and motion execution, leading to unstable or failed alignment. Rather than modifying pose estimation algorithms, we propose a Reliability-aware Execution Gating mechanism that operates at the execution level. The proposed approach evaluates geometric consistency and configuration risk before execution, and selectively rejects or scales high-risk pose updates. We validate the proposed method on a real UR5 robotic platform performing single-step visual alignment tasks under varying camera-target distances and off-axis configurations. Experimental results demonstrate that the proposed execution gating significantly improves task success rates, reduces execution variance, and suppresses tail-risk behavior, while leaving average pose accuracy largely unchanged. Importantly, the proposed mechanism is estimator-agnostic and can be readily integrated with both classical geometry-based and learning-based pose estimation pipelines. These results highlight the importance of execution-level reliability modeling and provide a practical solution for improving robustness in near-field vision-guided robotic systems.

59. 【2602.08462】riC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

链接https://arxiv.org/abs/2602.08462

作者:Yiyang Cao,Yunze Deng,Ziyu Lin,Bin Feng,Xinggang Wang,Wenyu Liu,Dandan Zheng,Jingdong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rapidly evolving field, computer vision, aims to produce, rapidly evolving, evolving field

备注

点击查看摘要

Abstract:Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: this https URL.

60. 【2602.08448】Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries

链接https://arxiv.org/abs/2602.08448

作者:Haocheng Lu,Nan Zhang,Wei Tao,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:arbitrary time points, poses distinct challenges, multimodal large language, video question answering, frames arrive sequentially

备注: Accepted to AAAI 2026 (Main Technical Track)

点击查看摘要

Abstract:Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.

61. 【2602.08439】Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

链接https://arxiv.org/abs/2602.08439

作者:Yuhao Dong,Shulin Tian,Shuai Liu,Shuangrui Ding,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Jiaqi Wang,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, recent Multimodal Large, Large Language Models, Multimodal Large, Large Language

备注

点击查看摘要

Abstract:Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.

62. 【2602.08430】Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features

链接https://arxiv.org/abs/2602.08430

作者:Qiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:training attention-based sparse, attention-based sparse image, sparse image matching, image matching models, revisit the problem

备注

点击查看摘要

Abstract:We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.

63. 【2602.08426】Prism: Spectral-Aware Block-Sparse Attention

链接https://arxiv.org/abs/2602.08426

作者:Xinghao Wang,Pengyu Wang,Xiaoran Liu,Fangxu Liu,Jason Chu,Kai Song,Xipeng Qiu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:long-context LLM pre-filling, accelerating long-context LLM, LLM pre-filling, long-context LLM, identifying relevant blocks

备注

点击查看摘要

Abstract:Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

64. 【2602.08397】RealSynCol: a high-fidelity synthetic colon dataset for 3D reconstruction applications

链接https://arxiv.org/abs/2602.08397

作者:Chiara Lena,Davide Milesi,Alessandro Casella,Luca Carlini,Joseph C. Norton,James Martin,Bruno Scaglioni,Keith L. Obstein,Roberto De Sire,Marco Spadaccini,Cesare Hassan,Pietro Valdastri,Elena De Momi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:colonoscopy by enabling, providing a comprehensive, surfaces and lesions, unexplored areas, potential to improve

备注

点击查看摘要

Abstract:Deep learning has the potential to improve colonoscopy by enabling 3D reconstruction of the colon, providing a comprehensive view of mucosal surfaces and lesions, and facilitating the identification of unexplored areas. However, the development of robust methods is limited by the scarcity of large-scale ground truth data. We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment. Colon geometries extracted from 10 CT scans were imported into a virtual environment that closely mimics intraoperative conditions and rendered with realistic vascular textures. The resulting dataset comprises 28\,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories. A benchmark study was conducted to evaluate the available synthetic colon datasets for the tasks of depth and pose estimation. Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images, proving it to be a powerful tool for developing deep learning algorithms to support endoscopic diagnosis.

65. 【2602.08395】D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy

链接https://arxiv.org/abs/2602.08395

作者:Jianfeng Liang,Shaocheng Shen,Botao Xu,Qiang Hu,Xiaoyun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex real-world degradations, prohibitive inference latency, delivering fantastic perceptual, video restoration, delivering fantastic

备注

点击查看摘要

Abstract:The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}

66. 【2602.08392】BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

链接https://arxiv.org/abs/2602.08392

作者:Xin Wu,Zhixuan Liang,Yue Ma,Mengkang Hu,Zhiyuan Qin,Xiu Li

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, significantly advanced embodied

备注: 38 pages, 9 figures. Project page: [this https URL](https://bimanibench.github.io/)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

67. 【2602.08388】Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

链接https://arxiv.org/abs/2602.08388

作者:Shuo Zhang,Wenzhuo Wu,Huayu Zhang,Jiarong Cheng,Xianghao Zang,Chao Ban,Hao Sun,Zhongjiang He,Tianwei Cao,Kongming Liang,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Recent, geometric, significantly improved image, geometric editing

备注

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.

68. 【2602.08355】E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

链接https://arxiv.org/abs/2602.08355

作者:Xianjie Liu,Yiman Hu,Liang Wu,Ping Hu,Yixiong Zou,Jian Xu,Bo Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dense multi-modal signals, online video industry, video industry characterized, short videos represent, represent a high-revenue

备注

点击查看摘要

Abstract:E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended QA pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

69. 【2602.08346】What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

链接https://arxiv.org/abs/2602.08346

作者:Yujin Zhou,Pengcheng Wen,Jiale Chen,Boqin Yin,Han Zhu,Jiaming Ji,Juntao Dai,Chi-Min Chan,Sirui Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision Language, Vision Language Models, Large Vision, Vision Language, demonstrated excellent abilities

备注

点击查看摘要

Abstract:The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

70. 【2602.08342】UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

链接https://arxiv.org/abs/2602.08342

作者:Jie Zhang,Xingtong Yu,Yuan Fang,Rudi Stouffs,Zdravko Trivic

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Learning transferable multimodal, transferable multimodal embeddings, lack explicit alignment, transferable multimodal, environments is challenging

备注

点击查看摘要

Abstract:Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

71. 【2602.08339】CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

链接https://arxiv.org/abs/2602.08339

作者:Chengyi Du,Yazhe Niu,Dazhong Shen,Luxin Xu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:improved image-text alignment, markedly improved image-text, Recent advances, image-text alignment, advances in vision-language

备注: 16 pages 6 figures

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.

72. 【2602.08337】Language-Guided Transformer Tokenizer for Human Motion Generation

链接https://arxiv.org/abs/2602.08337

作者:Sheng Yan,Yong Wang,Xin Du,Junsong Yuan,Mengyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:process proven crucial, converts raw motion, compact discrete tokens, converts raw, process proven

备注

点击查看摘要

Abstract:In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.

73. 【2602.08336】UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

链接https://arxiv.org/abs/2602.08336

作者:Cheng Yang,Chufan Shi,Bo Shui,Yaokang Wu,Muzi Tao,Huijuan Wang,Ivan Yee Lee,Yong Liu,Xuezhe Ma,Taylor Berg-Kirkpatrick

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:implicit visual requirements, recent unified multimodal, models increasingly adopt, multimodal models increasingly, increasingly adopt

备注: Project page: [this https URL](https://ureason.github.io)

点击查看摘要

Abstract:To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

74. 【2602.08309】CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

链接https://arxiv.org/abs/2602.08309

作者:Yunzuo Hu,Wen Li,Jing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:degraded representation quality, amplify irrelevant regions, Caption-Aligned Saliency-guided Enrichment, Audio-visual learning suffers, modality misalignment caused

备注: 13 pages, 8 figures

点击查看摘要

Abstract:Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

75. 【2602.08282】ghnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning

链接https://arxiv.org/abs/2602.08282

作者:Haixu Liu,Yufei Wang,Tianxiang Xu,Chuancheng Shi,Hongsheng Xing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:cross-species plant distribution, plant distribution prediction, distribution prediction plays, face significant challenges, cross-species plant

备注

点击查看摘要

Abstract:Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling efforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, offer broad spatial coverage and rich spatiotemporal distribution, but suffer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and different models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts.

76. 【2602.08277】PISCO: Precise Video Instance Insertion with Sparse Control

链接https://arxiv.org/abs/2602.08277

作者:Xiangbo Gao,Renjie Li,Xinghao Chen,Yuheng Wu,Suofei Feng,Qing Yin,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:moving beyond general, high-fidelity post-processing, undergoing a pivotal, relies on exhaustive, exhaustive prompt-engineering

备注

点击查看摘要

Abstract:The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: this http URL.

77. 【2602.08266】Informative Object-centric Next Best View for Object-aware 3D Gaussian Splatting in Cluttered Scenes

链接https://arxiv.org/abs/2602.08266

作者:Seunghoon Jeong,Eunho Lee,Jeongyun Kim,Ayoung Kim

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:selecting informative viewpoints, selecting informative, inevitable occlusions, occlusions and incomplete, essential for building

备注: 9 pages, 8 figures, 4 tables, accepted to ICRA 2026

点击查看摘要

Abstract:In cluttered scenes with inevitable occlusions and incomplete observations, selecting informative viewpoints is essential for building a reliable representation. In this context, 3D Gaussian Splatting (3DGS) offers a distinct advantage, as it can explicitly guide the selection of subsequent viewpoints and then refine the representation with new observations. However, existing approaches rely solely on geometric cues, neglect manipulation-relevant semantics, and tend to prioritize exploitation over exploration. To tackle these limitations, we introduce an instance-aware Next Best View (NBV) policy that prioritizes underexplored regions by leveraging object features. Specifically, our object-aware 3DGS distills instancelevel information into one-hot object vectors, which are used to compute confidence-weighted information gain that guides the identification of regions associated with erroneous and uncertain Gaussians. Furthermore, our method can be easily adapted to an object-centric NBV, which focuses view selection on a target object, thereby improving reconstruction robustness to object placement. Experiments demonstrate that our NBV policy reduces depth error by up to 77.14% on the synthetic dataset and 34.10% on the real-world GraspNet dataset compared to baselines. Moreover, compared to targeting the entire scene, performing NBV on a specific object yields an additional reduction of 25.60% in depth error for that object. We further validate the effectiveness of our approach through real-world robotic manipulation tasks.

78. 【2602.08262】Moving Beyond Functional Connectivity: Time-Series Modeling for fMRI-Based Brain Disorder Classification

链接https://arxiv.org/abs/2602.08262

作者:Guoqi Yu,Xiaowei Hu,Angelica I. Aviles-Rivero,Anqi Qiu,Shujun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, Functional magnetic resonance, enables non-invasive brain, non-invasive brain disorder, BOLD signals

备注: This paper has been accepted by IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) enables non-invasive brain disorder classification by capturing blood-oxygen-level-dependent (BOLD) signals. However, most existing methods rely on functional connectivity (FC) via Pearson correlation, which reduces 4D BOLD signals to static 2D matrices, discarding temporal dynamics and capturing only linear inter-regional relationships. In this work, we benchmark state-of-the-art temporal models (e.g., time-series models such as PatchTST, TimesNet, and TimeMixer) on raw BOLD signals across five public datasets. Results show these models consistently outperform traditional FC-based approaches, highlighting the value of directly modeling temporal information such as cycle-like oscillatory fluctuations and drift-like slow baseline trends. Building on this insight, we propose DeCI, a simple yet effective framework that integrates two key principles: (i) Cycle and Drift Decomposition to disentangle cycle and drift within each ROI (Region of Interest); and (ii) Channel-Independence to model each ROI separately, improving robustness and reducing overfitting. Extensive experiments demonstrate that DeCI achieves superior classification accuracy and generalization compared to both FC-based and temporal baselines. Our findings advocate for a shift toward end-to-end temporal modeling in fMRI analysis to better capture complex brain dynamics. The code is available at this https URL.

79. 【2602.08241】Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

链接https://arxiv.org/abs/2602.08241

作者:Siqu Ou,Tianrui Wan,Zhiyuan Zhao,Junyu Gao,Xuelong Li

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:existing approaches largely, approaches largely rely, provide limited mechanisms, substantially improved multimodal, improved multimodal large

备注

点击查看摘要

Abstract:While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.

80. 【2602.08236】When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

链接https://arxiv.org/abs/2602.08236

作者:Shoubin Yu,Yue Zhang,Zun Wang,Jaehong Yoon,Huaxiu Yao,Mingyu Ding,Mohit Bansal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, correct answers depend, progress in Multimodal

备注: the first two authors are equally contributed. Project page: [this https URL](https://adaptive-visual-tts.github.io/)

点击查看摘要

Abstract:Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

81. 【2602.08230】Generating Adversarial Events: A Motion-Aware Point Cloud Framework

链接https://arxiv.org/abs/2602.08230

作者:Hongwei Ren,Youxin Jiang,Qifei Gu,Xiangqian Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:autonomous driving, human-computer interaction, widely adopted, adopted in safety-critical, safety-critical domains

备注

点击查看摘要

Abstract:Event cameras have been widely adopted in safety-critical domains such as autonomous driving, robotics, and human-computer interaction. A pressing challenge arises from the vulnerability of deep neural networks to adversarial examples, which poses a significant threat to the reliability of event-based systems. Nevertheless, research into adversarial attacks on events is scarce. This is primarily due to the non-differentiable nature of mainstream event representations, which hinders the extension of gradient-based attack methods. In this paper, we propose MA-ADV, a novel \textbf{M}otion-\textbf{A}ware \textbf{Adv}ersarial framework. To the best of our knowledge, this is the first work to generate adversarial events by leveraging point cloud representations. MA-ADV accounts for high-frequency noise in events and employs a diffusion-based approach to smooth perturbations, while fully leveraging the spatial and temporal relationships among events. Finally, MA-ADV identifies the minimal-cost perturbation through a combination of sample-wise Adam optimization, iterative refinement, and binary search. Extensive experimental results validate that MA-ADV ensures a 100\% attack success rate with minimal perturbation cost, and also demonstrate enhanced robustness against defenses, underscoring the critical security challenges facing future event-based perception systems.

82. 【2602.08224】Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

链接https://arxiv.org/abs/2602.08224

作者:Jing Zhang,Zhikai Li,Xuewen Liu,Qingyi Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time video processing, shows excellent performance, heavy computational burden, computational burden hinders, video object segmentation

备注: ICLR 2026,Code is available at: [this https URL](https://github.com/jingjing0419/Efficient-SAM2)

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

83. 【2602.08211】Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

链接https://arxiv.org/abs/2602.08211

作者:Yik Lung Pang,Changjae Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:referring expression comprehension, expression comprehension, involves the localisation, referring expression, referred object

备注: 4 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.

84. 【2602.08206】Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

链接https://arxiv.org/abs/2602.08206

作者:Chufeng Zhou,Jian Wang,Xinyuan Liu,Xiaokang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:promising research direction, pre-defined category sets, remote sensing, enabling the recognition, promising research

备注: 5 pages, 3 figures

点击查看摘要

Abstract:Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.

85. 【2602.08202】Generative Regression for Left Ventricular Ejection Fraction Estimation from Echocardiography Video

链接https://arxiv.org/abs/2602.08202

作者:Jinrong Lv,Xun Gong,Zhaohuan Li,Weili Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Estimating Left Ventricular, Ventricular Ejection Fraction, Left Ventricular Ejection, Estimating Left, Ejection Fraction

备注: 11 pages, 5 tables, 10 figures. Under peer review

点击查看摘要

Abstract:Estimating Left Ventricular Ejection Fraction (LVEF) from echocardiograms constitutes an ill-posed inverse problem. Inherent noise, artifacts, and limited viewing angles introduce ambiguity, where a single video sequence may map not to a unique ground truth, but rather to a distribution of plausible physiological values. Prevailing deep learning approaches typically formulate this task as a standard regression problem that minimizes the Mean Squared Error (MSE). However, this paradigm compels the model to learn the conditional expectation, which may yield misleading predictions when the underlying posterior distribution is multimodal or heavy-tailed -- a common phenomenon in pathological scenarios. In this paper, we investigate the paradigm shift from deterministic regression toward generative regression. We propose the Multimodal Conditional Score-based Diffusion model for Regression (MCSDR), a probabilistic framework designed to model the continuous posterior distribution of LVEF conditioned on echocardiogram videos and patient demographic attribute priors. Extensive experiments conducted on the EchoNet-Dynamic, EchoNet-Pediatric, and CAMUS datasets demonstrate that MCSDR achieves state-of-the-art performance. Notably, qualitative analysis reveals that the generation trajectories of our model exhibit distinct behaviors in cases characterized by high noise or significant physiological variability, thereby offering a novel layer of interpretability for AI-aided diagnosis.

86. 【2602.08198】PEGAsus: 3D Personalization of Geometry and Appearance

链接https://arxiv.org/abs/2602.08198

作者:Jingyu Hu,Bin Hu,Ka-Hei Hui,Haipeng Li,Zhengzhe Liu,Daniel Cohen-Or,Chi-Wing Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Geometry and Appearance, Appearance levels, framework capable, capable of generating, Appearance

备注

点击查看摘要

Abstract:We present PEGAsus, a new framework capable of generating Personalized 3D shapes by learning shape concepts at both Geometry and Appearance levels. First, we formulate 3D shape personalization as extracting reusable, category-agnostic geometric and appearance attributes from reference shapes, and composing these attributes with text to generate novel shapes. Second, we design a progressive optimization strategy to learn shape concepts at both the geometry and appearance levels, decoupling the shape concept learning process. Third, we extend our approach to region-wise concept learning, enabling flexible concept extraction, with context-aware and context-free losses. Extensive experimental results show that PEGAsus is able to effectively extract attributes from a wide range of reference shapes and then flexibly compose these concepts with text to synthesize new shapes. This enables fine-grained control over shape generation and supports the creation of diverse, personalized results, even in challenging cross-category scenarios. Both quantitative and qualitative experiments demonstrate that our approach outperforms existing state-of-the-art solutions.

87. 【2602.08189】Chamelion: Reliable Change Detection for Long-Term LiDAR Mapping in Transient Environments

链接https://arxiv.org/abs/2602.08189

作者:Seoyeon Jang,Alex Junho Lee,I Made Aswin Nahrendra,Hyun Myung

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Online change detection, crucial for mobile, mobile robots, robots to efficiently, efficiently navigate

备注: 8 pages, IEEE Robot. Automat. Lett. (RA-L) 2026

点击查看摘要

Abstract:Online change detection is crucial for mobile robots to efficiently navigate through dynamic environments. Detecting changes in transient settings, such as active construction sites or frequently reconfigured indoor spaces, is particularly challenging due to frequent occlusions and spatiotemporal variations. Existing approaches often struggle to detect changes and fail to update the map across different observations. To address these limitations, we propose a dual-head network designed for online change detection and long-term map maintenance. A key difficulty in this task is the collection and alignment of real-world data, as manually registering structural differences over time is both labor-intensive and often impractical. To overcome this, we develop a data augmentation strategy that synthesizes structural changes by importing elements from different scenes, enabling effective model training without the need for extensive ground-truth annotations. Experiments conducted at real-world construction sites and in indoor office environments demonstrate that our approach generalizes well across diverse scenarios, achieving efficient and accurate map updates.\resubmit{Our source code and additional material are available at: this https URL.

88. 【2602.08168】DAS-SK: An Adaptive Model Integrating Dual Atrous Separable and Selective Kernel CNN for Agriculture Semantic Segmentation

链接https://arxiv.org/abs/2602.08168

作者:Mei Ling Chee,Thangarajah Akilan,Aparna Ravindra Phalke,Kanchan Keisham

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic segmentation, imagery demands models, agricultural imagery demands, practical systems, imagery demands

备注: 13 pages

点击查看摘要

Abstract:Semantic segmentation in high-resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS-SK, a novel lightweight architecture that retrofits selective kernel convolution (SK-Conv) into the dual atrous separable convolution (DAS-Conv) module to strengthen multi-scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine-grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones - MobileNetV3-Large and EfficientNet-B3, the DAS-SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: this http URL, VDD, and PhenoBench, demonstrate that DAS-SK consistently achieves state-of-the-art performance, while being more efficient than CNN-, transformer-, and hybrid-based competitors. Notably, DAS-SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top-performing transformer models. These findings establish DAS-SK as a robust, efficient, and scalable solution for real-time agricultural robotics and high-resolution remote sensing, with strong potential for broader deployment in other vision domains.

89. 【2602.08167】Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

链接https://arxiv.org/abs/2602.08167

作者:Milan Ganai,Katie Luo,Jonas Frey,Clark Barrett,Marco Pavone

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:current methods rely, high-level plans, structural affordances, significantly enhanced, current methods

备注

点击查看摘要

Abstract:Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce RB-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate RB-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. RB-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

90. 【2602.08145】Reliable and Responsible Foundation Models: A Comprehensive Survey

链接https://arxiv.org/abs/2602.08145

作者:Xinyu Yang,Junlin Han,Rishi Bommasani,Jinqi Luo,Wenjie Qu,Wangchunshu Zhou,Adel Bibi,Xiyao Wang,Jaehong Yoon,Elias Stengel-Eskin,Shengbang Tong,Lingfeng Shen,Rafael Rafailov,Runjia Li,Zhaoyang Wang,Yiyang Zhou,Chenhang Cui,Yu Wang,Wenhao Zheng,Huichi Zhou,Jindong Gu,Zhaorun Chen,Peng Xia,Tony Lee,Thomas Zollo,Vikash Sehwag,Jixuan Leng,Jiuhai Chen,Yuxin Wen,Huan Zhang,Zhun Deng,Linjun Zhang,Pavel Izmailov,Pang Wei Koh,Yulia Tsvetkov,Andrew Wilson,Jiaheng Zhang,James Zou,Cihang Xie,Hao Wang,Philip Torr,Julian McAuley,David Alvarez-Melis,Florian Tramèr,Kaidi Xu,Suman Jana,Chris Callison-Burch,Rene Vidal,Filippos Kokkinos,Mohit Bansal,Beidi Chen,Huaxiu Yao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:Multimodal Large Language, Large Language Models, Image Generative Models, Video Generative Models, including Large Language

备注: TMLR camera-ready version

点击查看摘要

Abstract:Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.

91. 【2602.08136】Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

链接https://arxiv.org/abs/2602.08136

作者:Md Rafi Ur Rashid,MD Sadik Hossain Shanto,Vishnu Asutosh Dasu,Shagufta Mehnaz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:core part, Vision-Language Models, safety alignment, VLMs, attacks

备注: 22 Pages, long conference paper

点击查看摘要

Abstract:Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.

92. 【2602.08131】Fields of The World: A Field Guide for Extracting Agricultural Field Boundaries

链接https://arxiv.org/abs/2602.08131

作者:Isaac Corley,Hannah Kerner,Caleb Robinson,Jennifer Marcus

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:support crop monitoring, agricultural data products, yield estimation, disease estimation, building block

备注

点击查看摘要

Abstract:Field boundary maps are a building block for agricultural data products and support crop monitoring, yield estimation, and disease estimation. This tutorial presents the Fields of The World (FTW) ecosystem: a benchmark of 1.6M field polygons across 24 countries, pre-trained segmentation models, and command-line inference tools. We provide two notebooks that cover (1) local-scale field boundary extraction with crop classification and forest loss attribution, and (2) country-scale inference using cloud-optimized data. We use MOSAIKS random convolutional features and FTW derived field boundaries to map crop type at the field level and report macro F1 scores of 0.65--0.75 for crop type classification with limited labels. Finally, we show how to explore pre-computed predictions over five countries (4.76M km\textsuperscript{2}), with median predicted field areas from 0.06 ha (Rwanda) to 0.28 ha (Switzerland).

93. 【2602.08126】MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

链接https://arxiv.org/abs/2602.08126

作者:Venkatraman Narayanan,Bala Sai,Rahul Ahuja,Pratik Likhar,Varun Ravi Kumar,Senthil Yogamani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal fusion algorithms, persistent challenge, remain a persistent, LiDAR remain, Reliable

备注

点击查看摘要

Abstract:Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.

94. 【2602.08117】Building Damage Detection using Satellite Images and Patch-Based Transformer Methods

链接https://arxiv.org/abs/2602.08117

作者:Smriti Siva,Jan Cross-Zamirski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Rapid building damage, Rapid building, building damage assessment, post-disaster response, assessment is critical

备注: 8 pages, 5 figures

点击查看摘要

Abstract:Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.

Comments:
8 pages, 5 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.08117 [cs.CV]

(or
arXiv:2602.08117v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.08117

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
95. 【2602.08112】MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery

链接https://arxiv.org/abs/2602.08112

作者:Sidike Paheding,Abel Reyes-Angulo,Leo Thomas Ramos,Angel D. Sappa,Rajaneesh A.,Hiral P. B.,Sajin Kumar K. S.,Thomas Oommen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Martian surfaces, Martian, segmentation on Martian, isolated test set, Abstract

备注

点击查看摘要

Abstract:We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small-scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in-distribution settings. Dataset will be available at: this https URL

96. 【2602.08099】VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

链接https://arxiv.org/abs/2602.08099

作者:Issar Tzachor,Dvir Samuel,Rami Ben-Ari

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, generative Multimodal Large, adapted generative Multimodal, Multimodal Large

备注: Project page: [this https URL](https://iyttor.github.io/VidVec/)

点击查看摘要

Abstract:Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.

97. 【2602.08071】ViT-5: Vision Transformers for The Mid-2020s

链接https://arxiv.org/abs/2602.08071

作者:Feng Wang,Sucheng Ren,Tiezheng Zhang,Predrag Neskovic,Anand Bhattad,Cihang Xie,Alan Yuille

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraging architectural advancements, modernizing Vision Transformer, past five years, Vision Transformers, work presents

备注: Code is available at [this https URL](https://github.com/wangf3014/ViT-5)

点击查看摘要

Abstract:This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.

98. 【2602.08068】ReRoPE: Repurposing RoPE for Relative Camera Control

链接https://arxiv.org/abs/2602.08068

作者:Chunyang Li,Yuanbo Yang,Jiahao Shao,Hongyu Zhou,Katja Schwarz,Yiyi Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interactive content creation, content creation, viewpoints is essential, essential for applications, interactive content

备注

点击查看摘要

Abstract:Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: this https URL

99. 【2602.08059】DICE: Disentangling Artist Style from Content via Contrastive Subspace Decomposition in Diffusion Models

链接https://arxiv.org/abs/2602.08059

作者:Tong Zhang,Ru Zhang,Jianyi Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:imitate unique artistic, unique artistic styles, style, enabling users, recent proliferation

备注

点击查看摘要

Abstract:The recent proliferation of diffusion models has made style mimicry effortless, enabling users to imitate unique artistic styles without authorization. In deployed platforms, this raises copyright and intellectual-property risks and calls for reliable protection. However, existing countermeasures either require costly weight editing as new styles emerge or rely on an explicitly specified editing style, limiting their practicality for deployment-side safety. To address this challenge, we propose DICE (Disentanglement of artist Style from Content via Contrastive Subspace Decomposition), a training-free framework for on-the-fly artist style erasure. Unlike style editing that require an explicitly specified replacement style, DICE performs style purification, removing the artist's characteristics while preserving the user-intended content. Our core insight is that a model cannot truly comprehend the artist style from a single text or image alone. Consequently, we abandon the traditional paradigm of identifying style from isolated samples. Instead, we construct contrastive triplets to compel the model to distinguish between style and non-style features in the latent space. By formalizing this disentanglement process as a solvable generalized eigenvalue problem, we achieve precise identification of the style subspace. Furthermore, we introduce an Adaptive Attention Decoupling Editing strategy dynamically assesses the style concentration of each token and performs differential suppression and content enhancement on the QKV vectors. Extensive experiments demonstrate that DICE achieves a superior balance between the thoroughness of style erasure and the preservation of content integrity. DICE introduces an additional overhead of only 3 seconds to disentangle style, providing a practical and efficient technique for curbing style mimicry.

100. 【2602.08058】Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

链接https://arxiv.org/abs/2602.08058

作者:Xihang Yu,Rajat Talak,Lorenzo Shaikewitz,Luca Carlone

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)

关键词:geometrically accurate scene, measurement noise, geometrically accurate, sensor data, accurate scene reconstructions

备注: 15 pages

点击查看摘要

Abstract:In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

101. 【2602.08057】Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

链接https://arxiv.org/abs/2602.08057

作者:Yufei Wang,Haixu Liu,Tianxiang Xu,Chuancheng Shi,Hongsheng Xing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:iMiGUE tennis-interview dataset, multimodal weak-supervision framework, concealed emotions, framework and achieves, tennis-interview dataset

备注

点击查看摘要

Abstract:To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.

102. 【2602.08047】Vanilla Group Equivariant Vision Transformer: Simple and Effective

链接https://arxiv.org/abs/2602.08047

作者:Jiahong Fu,Qi Xie,Deyu Meng,Zongben Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Incorporating symmetry priors, Incorporating symmetry, Patch Embedding, including patch embedding, symmetry priors

备注

点击查看摘要

Abstract:Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.

103. 【2602.08046】Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects

链接https://arxiv.org/abs/2602.08046

作者:Yahia Hamdi,Nicolas Andrialovanirina,Kélig Mahé,Emilie Poisson Caillault

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Adversarial Networks, computer vision, generation and completion, represent a transformative, Adversarial Networks

备注: 11

点击查看摘要

Abstract:The generation and completion of 3D objects represent a transformative challenge in computer vision. Generative Adversarial Networks (GANs) have recently demonstrated strong potential in synthesizing realistic visual data. However, they often struggle to capture complex and diverse data distributions, particularly in scenarios involving incomplete inputs or significant missing regions. These challenges arise mainly from the high computational requirements and the difficulty of modeling heterogeneous and structurally intricate data, which restrict their applicability in real-world settings. Mixture of Experts (MoE) models have emerged as a promising solution to these limitations. By dynamically selecting and activating the most relevant expert sub-networks for a given input, MoEs improve both performance and efficiency. In this paper, we investigate the integration of Deep 3D Convolutional GANs (CGANs) with a MoE framework to generate high-quality 3D models and reconstruct incomplete or damaged objects. The proposed architecture incorporates multiple generators, each specialized to capture distinct modalities within the dataset. Furthermore, an auxiliary loss-free dynamic capacity constraint (DCC) mechanism is introduced to guide the selection of categorical generators, ensuring a balance between specialization, training stability, and computational efficiency, which is critical for 3D voxel processing. We evaluated the model's ability to generate and complete shapes with missing regions of varying sizes and compared its performance with state-of-the-art approaches. Both quantitative and qualitative results confirm the effectiveness of the proposed MoE-DCGAN in handling complex 3D data.

104. 【2602.08025】MIND: Benchmarking Memory Consistency and Action Control in World Models

链接https://arxiv.org/abs/2602.08025

作者:Yixuan Ye,Xuanyu Lu,Yuxin Jiang,Yuchao Gu,Rui Zhao,Qiwei Liang,Jiachun Pan,Fengda Zhang,Weijia Wu,Alex Jinpeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:dynamic visual environments, predict dynamic visual, World models aim, abilities remains lacking, fundamental abilities remains

备注

点击查看摘要

Abstract:World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: this https URL

105. 【2602.08024】FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

链接https://arxiv.org/abs/2602.08024

作者:Ziyang Fan,Keyu Chen,Ruilong Xing,Yulin Li,Li Jiang,Zhuotao Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Video Large Language, Language Models, Large Language, shown remarkable capabilities

备注: Accepted by ICLR 2026 (Oral)

点击查看摘要

Abstract:Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at this https URL.

106. 【2602.08020】PhysDrape: Learning Explicit Forces and Collision Constraints for Physically Realistic Garment Draping

链接https://arxiv.org/abs/2602.08020

作者:Minghai Chen,Mingyuan Liu,Yuxiang Huan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:traditional Physics-Based Simulation, Deep learning-based garment, Deep learning-based, Physics-Based Simulation, robust collision handling

备注

点击查看摘要

Abstract:Deep learning-based garment draping has emerged as a promising alternative to traditional Physics-Based Simulation (PBS), yet robust collision handling remains a critical bottleneck. Most existing methods enforce physical validity through soft penalties, creating an intrinsic trade-off between geometric feasibility and physical plausibility: penalizing collisions often distorts mesh structure, while preserving shape leads to interpenetration. To resolve this conflict, we present PhysDrape, a hybrid neural-physical solver for physically realistic garment draping driven by explicit forces and constraints. Unlike soft-constrained frameworks, PhysDrape integrates neural inference with explicit geometric solvers in a fully differentiable pipeline. Specifically, we propose a Physics-Informed Graph Neural Network conditioned on a physics-enriched graph -- encoding material parameters and body proximity -- to predict residual displacements. Crucially, we integrate a differentiable two-stage solver: first, a learnable Force Solver iteratively resolves unbalanced forces derived from the Saint Venant-Kirchhoff (StVK) model to ensure quasi-static equilibrium; second, a Differentiable Projection strictly enforces collision constraints against the body surface. This differentiable design guarantees physical validity through explicit constraints, while enabling end-to-end learning to optimize the network for physically consistent predictions. Extensive experiments demonstrate that PhysDrape achieves state-of-the-art performance, ensuring negligible interpenetration with significantly lower strain energy compared to existing baselines, achieving superior physical fidelity and robustness in real-time.

107. 【2602.08006】ForecastOcc: Vision-based Semantic Occupancy Forecasting

链接https://arxiv.org/abs/2602.08006

作者:Riya Mohan,Juana Valeria Hurtado,Rohit Mohan,Abhinav Valada

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:time to effectively, effectively reason, occupancy, forecasting, semantic occupancy

备注

点击查看摘要

Abstract:Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

108. 【2602.07993】MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

链接https://arxiv.org/abs/2602.07993

作者:Xuehai Bai,Xiaoling Gu,Akide Liu,Hangjie Yuan,YiFan Zhang,Jack Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:shown remarkable progress, Recent advances, instruction-based image editing, remarkable progress, image editing

备注: Accepted by AAAI2026

点击查看摘要

Abstract:Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the denoising process, while the latter preserves features in unedited regions to maintain background consistency. To enable effective training, we construct a dedicated data pipeline to mitigate the scarcity of complex instruction-based image editing datasets, combining fine-grained automatic filtering via a powerful MLLM with rigorous human validation. Finally, to comprehensively evaluate complex instruction-based image editing, we introduce CIE-Bench, a new benchmark with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments, achieving a 23.96% improvement in instruction compliance.

109. 【2602.07986】Deepfake Synthesis vs. Detection: An Uneven Contest

链接https://arxiv.org/abs/2602.07986

作者:Md. Tarek Hasan,Sanjay Saha,Shaojing Fan,Swakkhar Shatabda,Terence Sim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Generative Adversarial Networks, traditional Generative Adversarial, synthetic media, rapid advancement

备注

点击查看摘要

Abstract:The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.

110. 【2602.07980】Continuity-driven Synergistic Diffusion with Neural Priors for Ultra-Sparse-View CBCT Reconstruction

链接https://arxiv.org/abs/2602.07980

作者:Junlin Wang,Jiancheng Fang,Peng Peng,Shaoyu Wang,Qiegen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cone-beam computed tomography, computed tomography, clinical application, application of cone-beam, cone-beam computed

备注

点击查看摘要

Abstract:The clinical application of cone-beam computed tomography (CBCT) is constrained by the inherent trade-off between radiation exposure and image quality. Ultra-sparse angular sampling, employed to reduce dose, introduces severe undersampling artifacts and inter-slice inconsistencies, compromising diagnostic reliability. Existing reconstruction methods often struggle to balance angular continuity with spatial detail fidelity. To address these challenges, we propose a Continuity-driven Synergistic Diffusion with Neural priors (CSDN) for ultra-sparse-view CBCT reconstruction. Neural priors are introduced as a structural foundation to encode a continuous threedimensional attenuation representation, enabling the synthesis of physically consistent dense projections from ultra-sparse measurements. Building upon this neural-prior-based initialization, a synergistic diffusion strategy is developed, consisting of two collaborative refinement paths: a Sinogram Refinement Diffusion (Sino-RD) process that restores angular continuity and a Digital Radiography Refinement Diffusion (DR-RD) process that enforces inter-slice consistency from the projection image perspective. The outputs of the two diffusion paths are adaptively fused by the Dual-Projection Reconstruction Fusion (DPRF) module to achieve coherent volumetric reconstruction. Extensive experiments demonstrate that the proposed CSDN effectively suppresses artifacts and recovers fine textures under ultra-sparse-view conditions, outperforming existing state-of-the-art techniques.

111. 【2602.07979】FSP-Diff: Full-Spectrum Prior-Enhanced DualDomain Latent Diffusion for Ultra-Low-Dose Spectral CT Reconstruction

链接https://arxiv.org/abs/2602.07979

作者:Peng Peng,Xinrui Zhang,Junlin Wang,Lei Li,Shaoyu Wang,Qiegen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:photon-counting detectors holds, detectors holds immense, Spectral computed tomography, holds immense potential, computed tomography

备注

点击查看摘要

Abstract:Spectral computed tomography (CT) with photon-counting detectors holds immense potential for material discrimination and tissue characterization. However, under ultra-low-dose conditions, the sharply degraded signal-to-noise ratio (SNR) in energy-specific projections poses a significant challenge, leading to severe artifacts and loss of structural details in reconstructed images. To address this, we propose FSP-Diff, a full-spectrum prior-enhanced dual-domain latent diffusion framework for ultra-low-dose spectral CT reconstruction. Our framework integrates three core strategies: 1) Complementary Feature Construction: We integrate direct image reconstructions with projection-domain denoised results. While the former preserves latent textural nuances amidst heavy noise, the latter provides a stable structural scaffold to balance detail fidelity and noise suppression. 2) Full-Spectrum Prior Integration: By fusing multi-energy projections into a high-SNR full-spectrum image, we establish a unified structural reference that guides the reconstruction across all energy bins. 3) Efficient Latent Diffusion Synthesis: To alleviate the high computational burden of high-dimensional spectral data, multi-path features are embedded into a compact latent space. This allows the diffusion process to facilitate interactive feature fusion in a lower-dimensional manifold, achieving accelerated reconstruction while maintaining fine-grained detail restoration. Extensive experiments on simulated and real-world datasets demonstrate that FSP-Diff significantly outperforms state-of-the-art methods in both image quality and computational efficiency, underscoring its potential for clinically viable ultra-low-dose spectral CT imaging.

112. 【2602.07967】EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

链接https://arxiv.org/abs/2602.07967

作者:Xiaofeng Tan,Wanjiang Weng,Haodong Lei,Hongsong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergone significant advancement, significant advancement, downstream objectives, undergone significant, pose challenges

备注

点击查看摘要

Abstract:In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms DRaFT-50 by 8.2% in alignment (MM-Dist) improvement while requiring only 31.16% of its additional memory overhead and achieving a 7.3x training speedup. The project page is available at this link {this https URL}.

113. 【2602.07960】D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

链接https://arxiv.org/abs/2602.07960

作者:Changli Tang,Tianyi Wang,Fengyun Rao,Jing Lyu,Chao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:https URL, Spoken dialogue, textbf, accurately identifying, primary source

备注

点击查看摘要

Abstract:Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{this https URL}{this https URL}. Our code, data, and checkpoints will be available at \href{this https URL}{this https URL}.

114. 【2602.07955】One-Shot Crowd Counting With Density Guidance For Scene Adaptaion

链接https://arxiv.org/abs/2602.07955

作者:Jiwei Chen,Qi Wang,Junyu Gao,Jing Zhang,Dingyi Li,Jing-Jia Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unseen surveillance scenes, unseen surveillance scene, locations vary greatly, unseen surveillance, existing crowd models

备注

点击查看摘要

Abstract:Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.

115. 【2602.07938】Integrating Specialized and Generic Agent Motion Prediction with Dynamic Occupancy Grid Maps

链接https://arxiv.org/abs/2602.07938

作者:Rabbia Asghar,Lukas Rummelhard,Wenqian Liu,Anne Spalanzani,Christian Laugier

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:challenging task due, occupancy grid maps, multiple feasible futures, Accurate prediction, sensor data

备注: Updated version with major revisions; currently under the second round of review at IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:Accurate prediction of driving scene is a challenging task due to uncertainty in sensor data, the complex behaviors of agents, and the possibility of multiple feasible futures. Existing prediction methods using occupancy grid maps primarily focus on agent-agnostic scene predictions, while agent-specific predictions provide specialized behavior insights with the help of semantic information. However, both paradigms face distinct limitations: agent-agnostic models struggle to capture the behavioral complexities of dynamic actors, whereas agent-specific approaches fail to generalize to poorly perceived or unrecognized agents; combining both enables robust and safer motion forecasting. To address this, we propose a unified framework by leveraging Dynamic Occupancy Grid Maps within a streamlined temporal decoding pipeline to simultaneously predict future occupancy state grids, vehicle grids, and scene flow grids. Relying on a lightweight spatiotemporal backbone, our approach is centered on a tailored, interdependent loss function that captures inter-grid dependencies and enables diverse future predictions. By using occupancy state information to enforce flow-guided transitions, the loss function acts as a regularizer that directs occupancy evolution while accounting for obstacles and occlusions. Consequently, the model not only predicts the specific behaviors of vehicle agents, but also identifies other dynamic entities and anticipates their evolution within the complex scene. Evaluations on real-world nuScenes and Woven Planet datasets demonstrate superior prediction performances for dynamic vehicles and generic dynamic scene elements compared to baseline methods.

116. 【2602.07931】Which private attributes do VLMs agree on and predict well?

链接https://arxiv.org/abs/2602.07931

作者:Olena Hrynenko,Darya Baranouskaya,Alina Elena Baia,Andrea Cavallaro

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Language Models, Language Models, Visual Language, Visual, detection of visual

备注: This work has been accepted to the ICASSP 2026

点击查看摘要

Abstract:Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.

117. 【2602.07919】Selective Fine-Tuning for Targeted and Robust Concept Unlearning

链接https://arxiv.org/abs/2602.07919

作者:Mansi,Avinash Kori,Francesca Toni,Soteris Demetriou

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Text guided diffusion, guided diffusion models, produce harmful content, Text guided, harmful content

备注: Given the brittle nature of existing methods in unlearning harmful content in diffusion models, we propose TRuST, a novel approach for dynamically estimating target concept neurons and unlearning them by selectively fine-tuning

点击查看摘要

Abstract:Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.

118. 【2602.07899】Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models

链接https://arxiv.org/abs/2602.07899

作者:Zhenhao Shang,Haizhao Jing,Guoting Wei,Haokui Zhang,Rong Xiao,Jianqing Gao,Peng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Post-training quantization, deploying large language, PTQ, primary approach, approach for deploying

备注

点击查看摘要

Abstract:Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning, and the quantized performance is often strongly affected by the calibration in PTQ. By contrast, in vision-language models (VLMs), substantial differences between visual and text tokens in their activation distributions and sensitivities to quantization error pose significant challenges for effective calibration during PTQ. In this work, we rethink what PTQ calibration should align with in VLMs and propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ). Guided by gradient information, we design a token-level importance integration mechanism for quantization error, and use it to construct a token-level calibration set, enabling a more fine-grained calibration strategy. Furthermore, TLQ introduces a multi-GPU, quantization-exposed layer-wise calibration scheme. This scheme keeps the layer-wise calibration procedure consistent with the true quantized inference path and distributes the complex layer-wise calibration workload across multiple RTX3090 GPUs, thereby reducing reliance on the large memory of A100 GPUs. TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings, indicating its strong quantization stability. The code will be released publicly.

119. 【2602.07891】Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

链接https://arxiv.org/abs/2602.07891

作者:Zihui Gao,Ke Liu,Donny Y. Chen,Duochao Shi,Guosheng Lin,Hao Chen,Chunhua Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Geometric foundation models, Geometric foundation, scarcity of diverse, Sparse Geometric Anchoring, progress is severely

备注

点击查看摘要

Abstract:Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.

120. 【2602.07888】Research on a Camera Position Measurement Method based on a Parallel Perspective Error Transfer Model

链接https://arxiv.org/abs/2602.07888

作者:Ning Hu,Shuai Li,Jindong Tan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Camera pose estimation, geometric computer vision, heterogeneous measurement noise, pose estimation, Camera pose

备注: 32 pages, 19 figures

点击查看摘要

Abstract:Camera pose estimation from sparse correspondences is a fundamental problem in geometric computer vision and remains particularly challenging in near-field scenarios, where strong perspective effects and heterogeneous measurement noise can significantly degrade the stability of analytic PnP solutions. In this paper, we present a geometric error propagation framework for camera pose estimation based on a parallel perspective approximation. By explicitly modeling how image measurement errors propagate through perspective geometry, we derive an error transfer model that characterizes the relationship between feature point distribution, camera depth, and pose estimation uncertainty. Building on this analysis, we develop a pose estimation method that leverages parallel perspective initialization and error-aware weighting within a Gauss-Newton optimization scheme, leading to improved robustness in proximity operations. Extensive experiments on both synthetic data and real-world images, covering diverse conditions such as strong illumination, surgical lighting, and underwater low-light environments, demonstrate that the proposed approach achieves accuracy and robustness comparable to state-of-the-art analytic and iterative PnP methods, while maintaining high computational efficiency. These results highlight the importance of explicit geometric error modeling for reliable camera pose estimation in challenging near-field settings.

121. 【2602.07872】WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

链接https://arxiv.org/abs/2602.07872

作者:Mert Sonmezer,Serge Vasylechko,Duygu Atasoy,Seyda Ertekin,Sila Kurugol

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Retrieving wrist radiographs, clinically important cues, variable imaging views, analogous fracture patterns, Retrieving wrist

备注

点击查看摘要

Abstract:Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at this https URL.

122. 【2602.07864】hinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

链接https://arxiv.org/abs/2602.07864

作者:Chen Yang,Guanxin Lin,Youquan He,Peiyao Chen,Guanghe Liu,Yufan Mo,Zhouyuan Xu,Linhao Wang,Guohui Zhang,Zihang Zhang,Shenxiang Zeng,Chen Wang,Jiansheng Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:evaluate largely unconstrained, largely unconstrained scenes, benchmarks evaluate largely, crucial for vision, intelligence is crucial

备注

点击查看摘要

Abstract:Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: this https URL.

123. 【2602.07860】Recovering 3D Shapes from Ultra-Fast Motion-Blurred Images

链接https://arxiv.org/abs/2602.07860

作者:Fei Yu,Shudan Guo,Shiqing Xin,Beibei Wang,Haisen Zhao,Wenzheng Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:ultra-fast motion-blurred images, motion-blurred images, ultra-fast motion-blurred, motion-blurred images remains, shape recovery

备注: Accepted by 3DV 2026. Project page: [this https URL](https://maxmilite.github.io/rec-from-ultrafast-blur/)

点击查看摘要

Abstract:We consider the problem of 3D shape recovery from ultra-fast motion-blurred images. While 3D reconstruction from static images has been extensively studied, recovering geometry from extreme motion-blurred images remains challenging. Such scenarios frequently occur in both natural and industrial settings, such as fast-moving objects in sports (e.g., balls) or rotating machinery, where rapid motion distorts object appearance and makes traditional 3D reconstruction techniques like Multi-View Stereo (MVS) ineffective. In this paper, we propose a novel inverse rendering approach for shape recovery from ultra-fast motion-blurred images. While conventional rendering techniques typically synthesize blur by averaging across multiple frames, we identify a major computational bottleneck in the repeated computation of barycentric weights. To address this, we propose a fast barycentric coordinate solver, which significantly reduces computational overhead and achieves a speedup of up to 4.57x, enabling efficient and photorealistic simulation of high-speed motion. Crucially, our method is fully differentiable, allowing gradients to propagate from rendered images to the underlying 3D shape, thereby facilitating shape recovery through inverse rendering. We validate our approach on two representative motion types: rapid translation and rotation. Experimental results demonstrate that our method enables efficient and realistic modeling of ultra-fast moving objects in the forward simulation. Moreover, it successfully recovers 3D shapes from 2D imagery of objects undergoing extreme translational and rotational motion, advancing the boundaries of vision-based 3D reconstruction. Project page: this https URL

Comments:
Accepted by 3DV 2026. Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Cite as:
arXiv:2602.07860 [cs.CV]

(or
arXiv:2602.07860v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.07860

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
124. 【2602.07854】Geometry-Aware Rotary Position Embedding for Consistent Video World Model

链接https://arxiv.org/abs/2602.07854

作者:Chendong Xiang,Jiajun Liu,Jintao Zhang,Xiao Yang,Zhengwei Fang,Shizun Wang,Zijun Wang,Yingtian Zou,Hang Su,Jun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Predictive world models, simulate future observations, Predictive world, explicit camera control, world models

备注

点击查看摘要

Abstract:Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

125. 【2602.07835】VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

链接https://arxiv.org/abs/2602.07835

作者:Sanoojan Baliah,Yohan Abeysinghe,Rusiru Thushara,Khan Muhammad,Abhinav Dhall,Karthik Nandakumar,Muhammad Haris Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality face swapping, face swapping, Frequency Spectrum Attention, Spectrum Attention Interpolation, present a training-free

备注

点击查看摘要

Abstract:We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at this https URL.

126. 【2602.07833】SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

链接https://arxiv.org/abs/2602.07833

作者:Weijiang Lv,Yaoxuan Feng,Xiaobo Xia,Jiayu Wang,Yan Jing,Wenchao Chen,Bo Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, multimodal large language, traces remains unclear, language models, remains unclear

备注: 53 pages, 42 figures, 14 tables

点击查看摘要

Abstract:Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at this https URL.

127. 【2602.07827】Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection

链接https://arxiv.org/abs/2602.07827

作者:Guoting Wei,Xia Yuan,Yang Zhou,Haizhao Jing,Yu Liu,Xianbiao Qi,Chunxia Zhao,Haokui Zhang,Rong Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sensing Visual Grounding, Remote Sensing Visual, Visual Grounding, Remote Sensing, Sensing Visual

备注

点击查看摘要

Abstract:Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.

128. 【2602.07820】Back to Physics: Operator-Guided Generative Paths for SMS MRI Reconstruction

链接https://arxiv.org/abs/2602.07820

作者:Zhibo Chen,Yu Guan,Yajuan Huang,Chaoqi Chen,XiangJi,Qiuyun Fan,Dong Liang,Qiegen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undersampling enables highly, enables highly accelerated, strongly coupled inverse, coupled inverse problem, highly accelerated MRI

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Simultaneous multi-slice (SMS) imaging with in-plane undersampling enables highly accelerated MRI but yields a strongly coupled inverse problem with deterministic inter-slice interference and missing k-space data. Most diffusion-based reconstructions are formulated around Gaussian-noise corruption and rely on additional consistency steps to incorporate SMS physics, which can be mismatched to the operator-governed degradations in SMS acquisition. We propose an operator-guided framework that models the degradation trajectory using known acquisition operators and inverts this process via deterministic updates. Within this framework, we introduce an operator-conditional dual-stream interaction network (OCDI-Net) that explicitly disentangles target-slice content from inter-slice interference and predicts structured degradations for operator-aligned inversion, and we instantiate reconstruction as a two-stage chained inference procedure that performs SMS slice separation followed by in-plane completion. Experiments on fastMRI brain data and prospectively acquired in vivo diffusion MRI data demonstrate improved fidelity and reduced slice leakage over conventional and learning-based SMS reconstructions.

129. 【2602.07815】Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

链接https://arxiv.org/abs/2602.07815

作者:Simiao Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systematically compared modern, compared modern vision-language, Facial age estimation, modern vision-language models, Facial age

备注

点击查看摘要

Abstract:Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.

130. 【2602.07814】How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

链接https://arxiv.org/abs/2602.07814

作者:Simiao Ren,Yuchen Zhou,Xingyu Shen,Kidus Zewde,Tommy Duong,George Huang,Hatsanai(Neo)Tiangratanakul,Tsang(Dennis)Ng,En Wei,Jiayu Xue

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:maintaining content authenticity, reliable detection methods, AI-generated images proliferate, detection methods, digital platforms

备注

点击查看摘要

Abstract:As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$\rho$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $\chi^2$=121.01, $p10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

131. 【2602.07801】VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

链接https://arxiv.org/abs/2602.07801

作者:Wenqi Liu,Yunxiao Wang,Shijie Ma,Meng Liu,Qile Su,Tianke Zhang,Haonan Fan,Changyi Liu,Kaiyu Jiang,Jiankang Chen,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Yinwei Wei,Xuemeng Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:conventional uniform frame, key visual evidence, capture key visual, uniform frame sampling, conventional uniform

备注

点击查看摘要

Abstract:In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

132. 【2602.07784】Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing

链接https://arxiv.org/abs/2602.07784

作者:Jayawant Bodagala,Balaji Bodagala

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains limited due, traffic signal control, adaptive traffic signal, Real-world deployment, non-interpretable control policies

备注: Total pages: 9

点击查看摘要

Abstract:Real-world deployment of adaptive traffic signal control, to date, remains limited due to the uncertainty associated with vision-based perception, implicit safety, and non-interpretable control policies learned and validated mainly in simulation. In this paper, we introduce UCATSC, a model-based traffic signal control system that models traffic signal control at an intersection using a stochastic decision process with constraints and under partial observability, taking into account the uncertainty associated with vision-based perception. Unlike reinforcement learning methods that learn to predict safety using reward shaping, UCATSC predicts and enforces hard constraints related to safety and starvation prevention during counterfactual rollouts in belief space. The system is designed to improve traffic delay and emission while preventing safety-critical errors and providing interpretable control policy outputs based on explicit models.

133. 【2602.07775】Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

链接https://arxiv.org/abs/2602.07775

作者:Haodong Li,Shaoteng Liu,Zhe Lin,Manmohan Chandraker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable performance, video diffusion models, Rolling Sink, remarkable performance, diffusion models

备注: Figure PDFs were compressed to 150 dpi to comply with arXiv's submission size limit. Project page: [this https URL](https://rolling-sink.github.io/)

点击查看摘要

Abstract:Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: this https URL

134. 【2602.07768】PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

链接https://arxiv.org/abs/2602.07768

作者:Qiuming Luo,Yuebing Li,Feng Li,Chang Kong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Fine-Grained Visual Classification, large Vision-Language Models, Visual Classification, Distilling knowledge, Vision-Language Models

备注: 6pages, 3 figures, conference

点击查看摘要

Abstract:Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at this https URL.

135. 【2602.07736】Global Symmetry and Orthogonal Transformations from Geometrical Moment $n$-tuples

链接https://arxiv.org/abs/2602.07736

作者:Omar Tahri

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:effective object grasping, crucial for effective, effective object, orthogonal transformations, transformations

备注

点击查看摘要

Abstract:Detecting symmetry is crucial for effective object grasping for several reasons. Recognizing symmetrical features or axes within an object helps in developing efficient grasp strategies, as grasping along these axes typically results in a more stable and balanced grip, thereby facilitating successful manipulation. This paper employs geometrical moments to identify symmetries and estimate orthogonal transformations, including rotations and mirror transformations, for objects centered at the frame origin. It provides distinctive metrics for detecting symmetries and estimating orthogonal transformations, encompassing rotations, reflections, and their combinations. A comprehensive methodology is developed to obtain these functions in n-dimensional space, specifically moment \( n \)-tuples. Extensive validation tests are conducted on both 2D and 3D objects to ensure the robustness and reliability of the proposed approach. The proposed method is also compared to state-of-the-art work using iterative optimization for detecting multiple planes of symmetry. The results indicate that combining our method with the iterative one yields satisfactory outcomes in terms of the number of symmetry planes detected and computation time.

136. 【2602.07717】All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving

链接https://arxiv.org/abs/2602.07717

作者:Yingjie Li,Daniel Robinson,Cunxi Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:Semantic segmentation, crucial tasks, Semantic, neural networks, tasks in autonomous

备注

点击查看摘要

Abstract:Semantic segmentation and lane detection are crucial tasks in autonomous driving systems. Conventional approaches predominantly rely on deep neural networks (DNNs), which incur high energy costs due to extensive analog-to-digital conversions and large-scale image computations required for low-latency, real-time responses. Diffractive optical neural networks (DONNs) have shown promising advantages over conventional DNNs on digital or optoelectronic computing platforms in energy efficiency. By performing all-optical image processing via light diffraction at the speed of light, DONNs save computation energy costs while reducing the overhead associated with analog-to-digital conversions by all-optical encoding and computing. In this work, we propose a novel all-optical computing framework for RGB image segmentation and lane detection in autonomous driving applications. Our experimental results demonstrate the effectiveness of the DONN system for image segmentation on the CityScapes dataset. Additionally, we conduct case studies on lane detection using a customized indoor track dataset and simulated driving scenarios in CARLA, where we further evaluate the model's generalizability under diverse environmental conditions.

137. 【2602.07702】A hybrid Kolmogorov-Arnold network for medical image segmentation

链接https://arxiv.org/abs/2602.07702

作者:Deep Bhattacharyya,Ali Ayub,A. Ben Hamza

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, capturing non-linear relationships, image segmentation plays, KAN Bernstein Spline, treatment planning

备注

点击查看摘要

Abstract:Medical image segmentation plays a vital role in diagnosis and treatment planning, but remains challenging due to the inherent complexity and variability of medical images, especially in capturing non-linear relationships within the data. We propose U-KABS, a novel hybrid framework that integrates the expressive power of Kolmogorov-Arnold Networks (KANs) with a U-shaped encoder-decoder architecture to enhance segmentation performance. The U-KABS model combines the convolutional and squeeze-and-excitation stage, which enhances channel-wise feature representations, and the KAN Bernstein Spline (KABS) stage, which employs learnable activation functions based on Bernstein polynomials and B-splines. This hybrid design leverages the global smoothness of Bernstein polynomials and the local adaptability of B-splines, enabling the model to effectively capture both broad contextual trends and fine-grained patterns critical for delineating complex structures in medical images. Skip connections between encoder and decoder layers support effective multi-scale feature fusion and preserve spatial details. Evaluated across diverse medical imaging benchmark datasets, U-KABS demonstrates superior performance compared to strong baselines, particularly in segmenting complex anatomical structures.

138. 【2602.07694】Semantic-Deviation-Anchored Multi-Branch Fusion for Unsupervised Anomaly Detection and Localization in Unstructured Conveyor-Belt Coal Scenes

链接https://arxiv.org/abs/2602.07694

作者:Wenping Jin,Yuyang Tang,Li Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent mining operations, Reliable foreign-object anomaly, conveyor-belt coal scenes, foreign-object anomaly detection, Reliable foreign-object

备注

点击查看摘要

Abstract:Reliable foreign-object anomaly detection and pixel-level localization in conveyor-belt coal scenes are essential for safe and intelligent mining operations. This task is particularly challenging due to the highly unstructured environment: coal and gangue are randomly piled, backgrounds are complex and variable, and foreign objects often exhibit low contrast, deformation, occlusion, resulting in coupling with their surroundings. These characteristics weaken the stability and regularity assumptions that many anomaly detection methods rely on in structured industrial settings, leading to notable performance degradation. To support evaluation and comparison in this setting, we construct \textbf{CoalAD}, a benchmark for unsupervised foreign-object anomaly detection with pixel-level localization in coal-stream scenes. We further propose a complementary-cue collaborative perception framework that extracts and fuses complementary anomaly evidence from three perspectives: object-level semantic composition modeling, semantic-attribution-based global deviation analysis, and fine-grained texture matching. The fused outputs provide robust image-level anomaly scoring and accurate pixel-level localization. Experiments on CoalAD demonstrate that our method outperforms widely used baselines across the evaluated image-level and pixel-level metrics, and ablation studies validate the contribution of each component. The code is available at this https URL.

139. 【2602.07689】Process-of-Thought Reasoning for Videos

链接https://arxiv.org/abs/2602.07689

作者:Jusheng Zhang,Kaitong Cai,Jian Wang,Yongsen Zheng,Kwok-Yan Lam,Keze Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:performing temporally grounded, recognizing visual content, Video understanding requires, temporally grounded, noisy observations

备注

点击查看摘要

Abstract:Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

140. 【2602.07680】Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

链接https://arxiv.org/abs/2602.07680

作者:Ross Greer,Maitrayee Keskar,Angel Martinez-Sanchez,Parthib Roy,Shashank Shriram,Mohan Trivedi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:powerful representation learning, align visual observations, offering new opportunities, recently emerged, emerged as powerful

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.

141. 【2602.07668】Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

链接https://arxiv.org/abs/2602.07668

作者:Ross Greer,Laura Fleig,Maitrayee Keskar,Erika Maquiling,Giovanni Tapia Lopez,Angel Martinez-Sanchez,Parthib Roy,Jake Rattigan,Mira Sur,Alejandra Vidrio,Thomas Marcotte,Mohan Trivedi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:smart airbag deployment, takeover time prediction, autonomous control transitions, driver attention monitoring, enabled intelligent vehicle

备注

点击查看摘要

Abstract:The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

142. 【2602.07658】Influence of Geometry, Class Imbalance and Alignment on Reconstruction Accuracy -- A Micro-CT Phantom-Based Evaluation

链接https://arxiv.org/abs/2602.07658

作者:Avinash Kumar K M,Samarth S. Raut

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical scans depends, mesh processing techniques, imaging hardware, created from medical, medical scans

备注: 22 pages, 13 figures

点击查看摘要

Abstract:The accuracy of the 3D models created from medical scans depends on imaging hardware, segmentation methods and mesh processing techniques etc. The effects of geometry type, class imbalance, voxel and point cloud alignment on accuracy remain to be thoroughly explored. This work evaluates the errors across the reconstruction pipeline and explores the use of voxel and surface-based accuracy metrics for different segmentation algorithms and geometry types. A sphere, a facemask, and an AAA were printed using the SLA technique and scanned using a micro-CT machine. Segmentation was performed using GMM, Otsu and RG based methods. Segmented and reference models aligned using the KU algorithm, were quantitatively compared to evaluate metrics like Dice and Jaccard scores, precision. Surface meshes were registered with reference meshes using an ICP-based alignment process. Metrics like chamfer distance, and average Hausdorff distance were evaluated. The Otsu method was found to be the most suitable method for all the geometries. AAA yielded low overlap scores due to its small wall thickness and misalignment. The effect of class imbalance on specificity was observed the most for AAA. Surface-based accuracy metrics differed from the voxel-based trends. The RG method performed best for sphere, while GMM and Otsu perform better for AAA. The facemask surface was most error-prone, possibly due to misalignment during the ICP process. Segmentation accuracy is a cumulative sum of errors across different stages of the reconstruction process. High voxel-based accuracy metrics may be misleading in cases of high class imbalance and sensitivity to alignment. The Jaccard index is found to be more stringent than the Dice and more suitable for accuracy assessment for thin-walled structures. Voxel and point cloud alignment should be ensured to make any reliable assessment of the reconstruction pipeline.

143. 【2602.07645】From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

链接https://arxiv.org/abs/2602.07645

作者:Leonardo Gonzalez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Google Slides slide, batch update API, editable Google Slides, data visualizations, reuse expensive

备注: Accepted for publication in the Companion Proceedings of the ACM Web Conference 2026 (WWW Companion '26), April 13-17, 2026, Dubai, United Arab Emirates

点击查看摘要

Abstract:Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

144. 【2602.07643】Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

链接https://arxiv.org/abs/2602.07643

作者:Yichi Zhang,Feiyang Xiao,Le Xue,Wenbo Zhang,Gang Feng,Chenguang Zheng,Yuan Qi,Yuan Cheng,Zixin Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:validation remains largely, remains largely confined, offer general-purpose capabilities, whole-body PET, medical foundation models

备注

点击查看摘要

Abstract:While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.

145. 【2602.07625】AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning

链接https://arxiv.org/abs/2602.07625

作者:Binxiao Xu,Junyu Feng,Xiaopeng Lin,Haodong Li,Zhiyuan Feng,Bohan Zeng,Shaolin Lu,Ming Lu,Qi She,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:essential for interpreting, interpreting the intricate, intricate relationship, Multimodal understanding, Multimodal

备注

点击查看摘要

Abstract:Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at this https URL.

146. 【2602.07608】HistoMet: A Pan-Cancer Deep Learning Framework for Prognostic Prediction of Metastatic Progression and Site Tropism from Primary Tumor Histopathology

链接https://arxiv.org/abs/2602.07608

作者:Yixin Chen,Ziyu Su,Lingbin Meng,Elshad Hasanov,Wei Chen,Anil Parwani,M. Khalid Khan Niazi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Metastatic Progression remains, primary tumor, Metastatic, cancer-related mortality, fundamental challenge

备注

点击查看摘要

Abstract:Metastatic Progression remains the leading cause of cancer-related mortality, yet predicting whether a primary tumor will metastasize and where it will disseminate directly from histopathology remains a fundamental challenge. Although whole-slide images (WSIs) provide rich morphological information, prior computational pathology approaches typically address metastatic status or site prediction as isolated tasks, and do not explicitly model the clinically sequential decision process of metastatic risk assessment followed by downstream site-specific evaluation. To address this research gap, we present a decision-aware, concept-aligned MIL framework, HistoMet, for prognostic metastatic outcome prediction from primary tumor WSIs. Our proposed framework adopts a two-module prediction pipeline in which the likelihood of metastatic progression from the primary tumor is first estimated, followed by conditional prediction of metastatic site for high-risk cases. To guide representation learning and improve clinical interpretability, our framework integrates linguistically defined and data-adaptive metastatic concepts through a pretrained pathology vision-language model. We evaluate HistoMet on a multi-institutional pan-cancer cohort of 6504 patients with metastasis follow-up and site annotations. Under clinically relevant high-sensitivity screening settings (95 percent sensitivity), HistoMet significantly reduces downstream workload while maintaining high metastatic risk recall. Conditional on metastatic cases, HistoMet achieves a macro F1 of 74.6 with a standard deviation of 1.3 and a macro one-vs-rest AUC of 92.1. These results demonstrate that explicitly modeling clinical decision structure enables robust and deployable prognostic prediction of metastatic progression and site tropism directly from primary tumor histopathology.

147. 【2602.07605】Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2602.07605

作者:Hulingxiao He,Zijun Geng,Yuxin Peng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:hierarchically grouped based, Multi-modal Large Language, Fine-Grained Visual Recognition, Large Language Models, contrastive CLIP models

备注: Published as a conference paper at ICLR 2026. The models are available at [this https URL](https://huggingface.co/collections/StevenHH2000/fine-r1)

点击查看摘要

Abstract:Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at this https URL.

148. 【2602.07595】Boost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

链接https://arxiv.org/abs/2602.07595

作者:Yuanzhi Liang,Xuan'er Wu,Yirui Liu,Yijie Fang,Yizhen Fan,Ke Hao,Rui Li,Ruiying Liu,Ziqi Ni,Peng Yu,Yanbo Wang,Haibin Huang,Qizhen Weng,Chi Zhang,Xuelong Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pretrained video generator, long temporal horizons, decisive step, step for converting, converting a pretrained

备注

点击查看摘要

Abstract:Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.

149. 【2602.07590】Automated rock joint trace mapping using a supervised learning model trained on synthetic data generated by parametric modelling

链接https://arxiv.org/abs/2602.07590

作者:Jessica Ka Yi Chiu,Tom Frode Hansen,Eivind Magnus Paulsen,Ole Jakob Mengshoel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:geology-driven machine learning, machine learning method, paper presents, presents a geology-driven, geology-driven machine

备注: 35 pages, 12 figures, 2 appendices

点击查看摘要

Abstract:This paper presents a geology-driven machine learning method for automated rock joint trace mapping from images. The approach combines geological modelling, synthetic data generation, and supervised image segmentation to address limited real data and class imbalance. First, discrete fracture network models are used to generate synthetic jointed rock images at field-relevant scales via parametric modelling, preserving joint persistence, connectivity, and node-type distributions. Second, segmentation models are trained using mixed training and pretraining followed by fine-tuning on real images. The method is tested in box and slope domains using several real datasets. The results show that synthetic data can support supervised joint trace detection when real data are scarce. Mixed training performs well when real labels are consistent (e.g. box-domain), while fine-tuning is more robust when labels are noisy (e.g. slope-domain where labels can be biased, incomplete, and inconsistent). Fully zero-shot prediction from synthetic model remains limited, but useful generalisation is achieved by fine-tuning with a small number of real data. Qualitative analysis shows clearer and more geologically meaningful joint traces than indicated by quantitative metrics alone. The proposed method supports reliable joint mapping and provides a basis for further work on domain adaptation and evaluation.

150. 【2602.07574】ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

链接https://arxiv.org/abs/2602.07574

作者:Wenjie Liu,Hao Wu,Xin Qiu,Yingqi Fan,Yihan Zhang,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Modern multimodal large, incurring substantial computational, large language models, unified self-attention design, Transformer layer

备注

点击查看摘要

Abstract:Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at this https URL.

151. 【2602.07568】Visualizing the Invisible: Enhancing Radiologist Performance in Breast Mammography via Task-Driven Chromatic Encoding

链接https://arxiv.org/abs/2602.07568

作者:Hui Ye,Shilong Yang,Yexuan Xing,Juan Yu,Yaoqin Xie,Wei Zhang,Chulong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:subtle findings increase, increase perceptual difficulty, findings increase perceptual, tissue overlap, overlap and subtle

备注

点击查看摘要

Abstract:Purpose:Mammography screening is less sensitive in dense breasts, where tissue overlap and subtle findings increase perceptual difficulty. We present MammoColor, an end-to-end framework with a Task-Driven Chromatic Encoding (TDCE) module that converts single-channel mammograms into TDCE-encoded views for visual augmentation. Materials and Methods:MammoColor couples a lightweight TDCE module with a BI-RADS triage classifier and was trained end-to-end on VinDr-Mammo. Performance was evaluated on an internal test set, two public datasets (CBIS-DDSM and INBreast), and three external clinical cohorts. We also conducted a multi-reader, multi-case (MRMC) observer study with a washout period, comparing (1) grayscale-only, (2) TDCE-only, and (3) side-by-side grayscale+TDCE. Results:On VinDr-Mammo, MammoColor improved AUC from 0.7669 to 0.8461 (P=0.004). Gains were larger in dense breasts (AUC 0.749 to 0.835). In the MRMC study, TDCE-encoded images improved specificity (0.90 to 0.96; P=0.052) with comparable sensitivity. Conclusion:TDCE provides a task-optimized chromatic representation that may improve perceptual salience and reduce false-positive recalls in mammography triage.

152. 【2602.07566】Cross-Camera Cow Identification via Disentangled Representation Learning

链接https://arxiv.org/abs/2602.07566

作者:Runcheng Wang,Yaru Chen,Guiguo Zhang,Honghua Jiang,Yongliang Qiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:comprehensive digital management, fundamental prerequisite, prerequisite for comprehensive, comprehensive digital, digital management

备注

点击查看摘要

Abstract:Precise identification of individual cows is a fundamental prerequisite for comprehensive digital management in smart livestock farming. While existing animal identification methods excel in controlled, single-camera settings, they face severe challenges regarding cross-camera generalization. When models trained on source cameras are deployed to new monitoring nodes characterized by divergent illumination, backgrounds, viewpoints, and heterogeneous imaging properties, recognition performance often degrades dramatically. This limits the large-scale application of non-contact technologies in dynamic, real-world farming environments. To address this challenge, this study proposes a cross-camera cow identification framework based on disentangled representation learning. This framework leverages the Subspace Identifiability Guarantee (SIG) theory in the context of bovine visual recognition. By modeling the underlying physical data generation process, we designed a principle-driven feature disentanglement module that decomposes observed images into multiple orthogonal latent subspaces. This mechanism effectively isolates stable, identity-related biometric features that remain invariant across cameras, thereby substantially improving generalization to unseen cameras. We constructed a high-quality dataset spanning five distinct camera nodes, covering heterogeneous acquisition devices and complex variations in lighting and angles. Extensive experiments across seven cross-camera tasks demonstrate that the proposed method achieves an average accuracy of 86.0%, significantly outperforming the Source-only Baseline (51.9%) and the strongest cross-camera baseline method (79.8%). This work establishes a subspace-theoretic feature disentanglement framework for collaborative cross-camera cow identification, offering a new paradigm for precise animal monitoring in uncontrolled smart farming environments.

153. 【2602.07565】Human Identification at a Distance: Challenges, Methods and Results on the Competition HID 2025

链接https://arxiv.org/abs/2602.07565

作者:Jingzhe Ma,Meng Zhang,Jianlong Yu,Kun Liu,Zunxiao Xu,Xue Cheng,Junjie Zhou,Yanfei Wang,Jiahang Li,Zepeng Wang,Kazuki Osamura,Rujie Liu,Narishige Abe,Jingjie Wang,Shunli Zhang,Haojun Xie,Jiajun Wu,Weiming Wu,Wenxiong Kang,Qingshuo Gao,Jiaming Xiong,Xianye Ben,Lei Chen,Lichen Song,Junjian Cui,Haijun Xiong,Junhao Lu,Bin Feng,Mengyuan Liu,Ji Zhou,Baoquan Zhao,Ke Xu,Yongzhen Huang,Liang Wang,Manuel J Marin-Jimenez,Md Atiqur Rahman Ahad,Shiqi Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:traditional biometric modalities, Human identification, real-world scenarios, HID, traditional biometric

备注: Accepted by IJCB 2025( [this https URL](https://ijcb2025.ieee-biometrics.org/competitions/) )

点击查看摘要

Abstract:Human identification at a distance (HID) is challenging because traditional biometric modalities such as face and fingerprints are often difficult to acquire in real-world scenarios. Gait recognition provides a practical alternative, as it can be captured reliably at a distance. To promote progress in gait recognition and provide a fair evaluation platform, the International Competition on Human Identification at a Distance (HID) has been organized annually since 2020. Since 2023, the competition has adopted the challenging SUSTech-Competition dataset, which features substantial variations in clothing, carried objects, and view angles. No dedicated training data are provided, requiring participants to train their models using external datasets. Each year, the competition applies a different random seed to generate distinct evaluation splits, which reduces the risk of overfitting and supports a fair assessment of cross-domain generalization. While HID 2023 and HID 2024 already used this dataset, HID 2025 explicitly examined whether algorithmic advances could surpass the accuracy limits observed previously. Despite the heightened difficulty, participants achieved further improvements, and the best-performing method reached 94.2% accuracy, setting a new benchmark on this dataset. We also analyze key technical trends and outline potential directions for future research in gait recognition.

154. 【2602.07564】SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

链接https://arxiv.org/abs/2602.07564

作者:Xiaoyan Zhang,Zechen Bai,Haofan Wang,Yiren Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paired image-edit data, Recent unified models, effectively align multiple, single diffusion transformer, Recent unified

备注

点击查看摘要

Abstract:Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.

155. 【2602.07555】VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

链接https://arxiv.org/abs/2602.07555

作者:Francesco Taioli,Shiping Yang,Sonia Raychaudhuri,Marco Cristani,Unnat Jain,Angel X Chang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Language-driven object navigation, interpret natural language, natural language descriptions, object navigation requires, Language-driven object

备注

点击查看摘要

Abstract:Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.

156. 【2602.07554】FlexID: Training-Free Flexible Identity Injection via Intent-Aware Modulation for Text-to-Image Generation

链接https://arxiv.org/abs/2602.07554

作者:Guandong Li,Yijun Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:seamlessly integrate specific, integrate specific identities, aims to seamlessly, seamlessly integrate, integrate specific

备注

点击查看摘要

Abstract:Personalized text-to-image generation aims to seamlessly integrate specific identities into textual descriptions. However, existing training-free methods often rely on rigid visual feature injection, creating a conflict between identity fidelity and textual adaptability. To address this, we propose FlexID, a novel training-free framework utilizing intent-aware modulation. FlexID orthogonally decouples identity into two dimensions: a Semantic Identity Projector (SIP) that injects high-level priors into the language space, and a Visual Feature Anchor (VFA) that ensures structural fidelity within the latent space. Crucially, we introduce a Context-Aware Adaptive Gating (CAG) mechanism that dynamically modulates the weights of these streams based on editing intent and diffusion timesteps. By automatically relaxing rigid visual constraints when strong editing intent is detected, CAG achieves synergy between identity preservation and semantic variation. Extensive experiments on IBench demonstrate that FlexID achieves a state-of-the-art balance between identity consistency and text adherence, offering an efficient solution for complex narrative generation.

157. 【2602.07550】Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

链接https://arxiv.org/abs/2602.07550

作者:Hussni Mohd Zakir,Eric Tatt Wei Ho

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent self-supervised Vision, self-supervised Vision Transformers, dense vision tasks, Vision Transformers, Recent self-supervised

备注: 10 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in this http URL code is publicly available at this https URL.

158. 【2602.07544】MUFASA: A Multi-Layer Framework for Slot Attention

链接https://arxiv.org/abs/2602.07544

作者:Sebastian Bock,Leonie Schüßler,Krishnakant Singh,Simone Schaub-Meyer,Stefan Roth

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:decomposes visual scenes, decomposes visual, distinct entities, Unsupervised object-centric learning, visual scenes

备注: Authors Sebastian Bock and Leonie Schüßler contributed equally. Project page: [this https URL](https://leonieschuessler.github.io/mufasa/)

点击查看摘要

Abstract:Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

159. 【2602.07540】LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

链接https://arxiv.org/abs/2602.07540

作者:Huimin Yan,Liang Bai,Xian Yang,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:paired data, CLIP-style medical vision, existing CLIP-style medical, substantial paired data, diagnostic evidence

备注

点击查看摘要

Abstract:Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

160. 【2602.07535】Beyond Core and Penumbra: Bi-Temporal Image-Driven Stroke Evolution Analysis

链接https://arxiv.org/abs/2602.07535

作者:Md Sazidur Rahman,Kjersti Engan,Kathinka Dæhli Kurz,Mahdieh Khanmohammadi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Computed tomography perfusion, follow-up diffusion-weighted MRI, Computed tomography, diffusion-weighted MRI, definitive infarct outcome

备注

点击查看摘要

Abstract:Computed tomography perfusion (CTP) at admission is routinely used to estimate the ischemic core and penumbra, while follow-up diffusion-weighted MRI (DWI) provides the definitive infarct outcome. However, single time-point segmentations fail to capture the biological heterogeneity and temporal evolution of stroke. We propose a bi-temporal analysis framework that characterizes ischemic tissue using statistical descriptors, radiomic texture features, and deep feature embeddings from two architectures (mJ-Net and nnU-Net). Bi-temporal refers to admission (T1) and post-treatment follow-up (T2). All features are extracted at T1 from CTP, with follow-up DWI aligned to ensure spatial correspondence. Manually delineated masks at T1 and T2 are intersected to construct six regions of interest (ROIs) encoding both initial tissue state and final outcome. Features were aggregated per region and analyzed in feature space. Evaluation on 18 patients with successful reperfusion demonstrated meaningful clustering of region-level representations. Regions classified as penumbra or healthy at T1 that ultimately recovered exhibited feature similarity to preserved brain tissue, whereas infarct-bound regions formed distinct groupings. Both baseline GLCM and deep embeddings showed a similar trend: penumbra regions exhibit features that are significantly different depending on final state, whereas this difference is not significant for core regions. Deep feature spaces, particularly mJ-Net, showed strong separation between salvageable and non-salvageable tissue, with a penumbra separation index that differed significantly from zero (Wilcoxon signed-rank test). These findings suggest that encoder-derived feature manifolds reflect underlying tissue phenotypes and state transitions, providing insight into imaging-based quantification of stroke evolution.

161. 【2602.07534】Fine-Grained Cat Breed Recognition with Global Context Vision Transformer

链接https://arxiv.org/abs/2602.07534

作者:Mowmita Parvin Hera,Md. Shahriar Mahmud Kallol,Shohanur Rahman Nirob,Md. Badsha Bulbul,Jubayer Ahmed,M. Zhourul Islam,Hazrat Ali,Mohammmad Farhad Bulbul

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Accurate identification, challenging task due, Oxford-IIIT Pet Dataset, facial structure, Context Vision Transformer

备注: 4 pages, accepted at International Conference on Computer and Information Technology (ICCIT) 2025

点击查看摘要

Abstract:Accurate identification of cat breeds from images is a challenging task due to subtle differences in fur patterns, facial structure, and color. In this paper, we present a deep learning-based approach for classifying cat breeds using a subset of the Oxford-IIIT Pet Dataset, which contains high-resolution images of various domestic breeds. We employed the Global Context Vision Transformer (GCViT) architecture-tiny for cat breed recognition. To improve model generalization, we used extensive data augmentation, including rotation, horizontal flipping, and brightness adjustment. Experimental results show that the GCViT-Tiny model achieved a test accuracy of 92.00% and validation accuracy of 94.54%. These findings highlight the effectiveness of transformer-based architectures for fine-grained image classification tasks. Potential applications include veterinary diagnostics, animal shelter management, and mobile-based breed recognition systems. We also provide a hugging face demo at this https URL.

162. 【2602.07532】Evaluating Object-Centric Models beyond Object Discovery

链接https://arxiv.org/abs/2602.07532

作者:Krishnakant Singh,Simone Schaub-Meyer,Stefan Roth

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:learn structured scene, support compositional generalization, Object-centric learning, structured scene representations, OCL models

备注: Project Page: [this https URL](https://guided-sa.github.io/eval-ocl/)

点击查看摘要

Abstract:Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

163. 【2602.07523】CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization

链接https://arxiv.org/abs/2602.07523

作者:Zhen Zhang,Qing Zhao,Xiuhe Li,Cheng Wang,Guoqiang Zhu,Yu Zhang,Yining Huo,Hongyi Yu,Yi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern complex environments, efficient target localization, complex environments, achieving accurate, numerous fields

备注: This work has been submitted to the IEEE for possible [this http URL](http://publication.Please) note that once the article has been published by IEEE, preprints on locations not specified above should be removed if possible

点击查看摘要

Abstract:In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the "brain" of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, this http URL time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.

164. 【2602.07512】Adaptive Image Zoom-in with Bounding Box Transformation for UAV Object Detection

链接https://arxiv.org/abs/2602.07512

作者:Tao Wang,Chenyu Lin,Chenwei Tang,Jizhe Zhou,Deng Xiong,Jianan Li,Jian Zhao,Jiancheng Lv

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:small object size, Detecting objects, object detection, challenging due, object

备注: paper accepted by ISPRS Journal of Photogrammetry and Remote Sensing ( IF=12.2)

点击查看摘要

Abstract:Detecting objects from UAV-captured images is challenging due to the small object size. In this work, a simple and efficient adaptive zoom-in framework is explored for object detection on UAV images. The main motivation is that the foreground objects are generally smaller and sparser than those in common scene images, which hinders the optimization of effective object detectors. We thus aim to zoom in adaptively on the objects to better capture object features for the detection task. To achieve the goal, two core designs are required: \textcolor{black}{i) How to conduct non-uniform zooming on each image efficiently? ii) How to enable object detection training and inference with the zoomed image space?} Correspondingly, a lightweight offset prediction scheme coupled with a novel box-based zooming objective is introduced to learn non-uniform zooming on the input image. Based on the learned zooming transformation, a corner-aligned bounding box transformation method is proposed. The method warps the ground-truth bounding boxes to the zoomed space to learn object detection, and warps the predicted bounding boxes back to the original space during inference. We conduct extensive experiments on three representative UAV object detection datasets, including VisDrone, UAVDT, and SeaDronesSee. The proposed ZoomDet is architecture-independent and can be applied to an arbitrary object detection architecture. Remarkably, on the SeaDronesSee dataset, ZoomDet offers more than 8.4 absolute gain of mAP with a Faster R-CNN model, with only about 3 ms additional latency. The code is available at this https URL.

165. 【2602.07498】IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation

链接https://arxiv.org/abs/2602.07498

作者:Zhufeng Xu,Xuan Gao,Feng-Lin Liu,Haoxian Zhang,Zhixue Fang,Yu-Kun Lai,Xiaoqiang Liu,Pengfei Wan,Lin Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced character animation, markedly advanced character, Recent progress, synthesizes motioned videos, video diffusion models

备注

点击查看摘要

Abstract:Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.

166. 【2602.07495】Learning Brain Representation with Hierarchical Visual Embeddings

链接https://arxiv.org/abs/2602.07495

作者:Jiawen Zheng,Haonan Jia,Ming Li,Yuhui Zheng,Yufeng Zeng,Yang Gao,Chen Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:attracted significant attention, artificial intelligence, attracted significant, significant attention, neuroscience and artificial

备注

点击查看摘要

Abstract:Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.

167. 【2602.07493】hermal odometry and dense mapping using learned ddometry and Gaussian splatting

链接https://arxiv.org/abs/2602.07493

作者:Tianhao Zhou,Yujia Chen,Zhihao Zhan,Yuhang Ming,Jianzhu Huai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:capture imagery independent, Thermal infrared sensors, smoke particles, infrared sensors, independent of darkness

备注: 11 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Thermal infrared sensors, with wavelengths longer than smoke particles, can capture imagery independent of darkness, dust, and smoke. This robustness has made them increasingly valuable for motion estimation and environmental perception in robotics, particularly in adverse conditions. Existing thermal odometry and mapping approaches, however, are predominantly geometric and often fail across diverse datasets while lacking the ability to produce dense maps. Motivated by the efficiency and high-quality reconstruction ability of recent Gaussian Splatting (GS) techniques, we propose TOM-GS, a thermal odometry and mapping method that integrates learning-based odometry with GS-based dense mapping. TOM-GS is among the first GS-based SLAM systems tailored for thermal cameras, featuring dedicated thermal image enhancement and monocular depth integration. Extensive experiments on motion estimation and novel-view rendering demonstrate that TOM-GS outperforms existing learning-based methods, confirming the benefits of learning-based pipelines for robust thermal odometry and dense reconstruction.

168. 【2602.07463】GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring

链接https://arxiv.org/abs/2602.07463

作者:Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Tayyaba Asif,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires efficient sorting, efficient sorting techniques, growing amount, requires efficient, efficient sorting

备注

点击查看摘要

Abstract:The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.

169. 【2602.07458】SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

链接https://arxiv.org/abs/2602.07458

作者:Yancheng Long,Yankai Yang,Hongyang Wei,Wei Chen,Tianke Zhang,Haonan fan,Changyi Liu,Kaiyu Jiang,Jiankang Chen,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Shuo Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Online Reinforcement Learning, Reinforcement Learning, offers a promising, Online Reinforcement, promising avenue

备注

点击查看摘要

Abstract:Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

170. 【2602.07449】SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

链接https://arxiv.org/abs/2602.07449

作者:Tan Yu,Qian Qiao,Le Shen,Ke Zhou,Jincheng Hu,Dian Sheng,Bo Hu,Haoming Qin,Jun Gao,Changhai Zhou,Shunshun Yin,Siyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Achieving a balance, audio-driven portrait generation, low-latency streaming remains, quality and low-latency, remains a formidable

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

171. 【2602.07446】PTB-XL-Image-17K: A Large-Scale Synthetic ECG Image Dataset with Comprehensive Ground Truth for Deep Learning-Based Digitization

链接https://arxiv.org/abs/2602.07446

作者:Naqcho Ali Mehdi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning applications, modern deep learning, scanned ECG images, ECG images back, ECG images

备注: 8 pages, 4 figures, dataset paper

点击查看摘要

Abstract:Electrocardiogram (ECG) digitization-converting paper-based or scanned ECG images back into time-series signals-is critical for leveraging decades of legacy clinical data in modern deep learning applications. However, progress has been hindered by the lack of large-scale datasets providing both ECG images and their corresponding ground truth signals with comprehensive annotations. We introduce PTB-XL-Image-17K, a complete synthetic ECG image dataset comprising 17,271 high-quality 12-lead ECG images generated from the PTB-XL signal database. Our dataset uniquely provides five complementary data types per sample: (1) realistic ECG images with authentic grid patterns and annotations (50% with visible grid, 50% without), (2) pixel-level segmentation masks, (3) ground truth time-series signals, (4) bounding box annotations in YOLO format for both lead regions and lead name labels, and (5) comprehensive metadata including visual parameters and patient information. We present an open-source Python framework enabling customizable dataset generation with controllable parameters including paper speed (25/50 mm/s), voltage scale (5/10 mm/mV), sampling rate (500 Hz), grid appearance (4 colors), and waveform characteristics. The dataset achieves 100% generation success rate with an average processing time of 1.35 seconds per sample. PTB-XL-Image-17K addresses critical gaps in ECG digitization research by providing the first large-scale resource supporting the complete pipeline: lead detection, waveform segmentation, and signal extraction with full ground truth for rigorous evaluation. The dataset, generation framework, and documentation are publicly available at this https URL and this https URL.

172. 【2602.07444】Perspective-aware fusion of incomplete depth maps and surface normals for accurate 3D reconstruction

链接https://arxiv.org/abs/2602.07444

作者:Ondrej Hlinka,Georg Kaniak,Christian Kapeller

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:sensor system based, single perspective camera, normal maps acquired, problem of reconstructing, address the problem

备注: submitted to IET Electronics Letters

点击查看摘要

Abstract:We address the problem of reconstructing 3D surfaces from depth and surface normal maps acquired by a sensor system based on a single perspective camera. Depth and normal maps can be obtained through techniques such as structured-light scanning and photometric stereo, respectively. We propose a perspective-aware log-depth fusion approach that extends existing orthographic gradient-based depth-normals fusion methods by explicitly accounting for perspective projection, leading to metrically accurate 3D reconstructions. Additionally, the method handles missing depth measurements by leveraging available surface normal information to inpaint gaps. Experiments on the DiLiGenT-MV data set demonstrate the effectiveness of our approach and highlight the importance of perspective-aware depth-normals fusion.

173. 【2602.07428】Row-Column Separated Attention Based Low-Light Image/Video Enhancement

链接https://arxiv.org/abs/2602.07428

作者:Chengqi Dong,Zhiyuan Cao,Tuoshi Qi,Kexin Wu,Yixing Gao,Fan Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structure is widely, global information, information, U-Net structure, Separated Attention module

备注

点击查看摘要

Abstract:U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module's input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at this https URL.

174. 【2602.07399】VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

链接https://arxiv.org/abs/2602.07399

作者:Changhua Xu,Jie Lu,Junyu Xuan,En Yu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:models bridge multimodal, bridge multimodal reasoning, textbf, demonstrations remains unreliable, models bridge

备注: Preprint

点击查看摘要

Abstract:Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at this https URL.

175. 【2602.07345】Optimizing Few-Step Generation with Adaptive Matching Distillation

链接https://arxiv.org/abs/2602.07345

作者:Lichen Bai,Zikai Zhou,Shitong Shao,Wenliang Zhong,Shuo Yang,Shuo Chen,Bojun Chen,Zeke Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Distribution Matching Distillation, powerful acceleration paradigm, fake teacher exerts, teacher exerts insufficient, Adaptive Matching Distillation

备注: 25 pages, 15 figures, 11 tables

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

176. 【2602.07343】Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

链接https://arxiv.org/abs/2602.07343

作者:Ruturaj Reddy,Hrishav Bakul Barua,Junn Yong Loo,Thanh Thi Nguyen,Ganesh Krishnasamy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:autonomous driving applications, shadow conditions remain, Robust semantic segmentation, driving applications, Robust semantic

备注

点击查看摘要

Abstract:Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

177. 【2602.07311】LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

链接https://arxiv.org/abs/2602.07311

作者:Difei Gu,Yunhe Gao,Gerasimos Chatzoudis,Zihan Dong,Guoning Zhang,Bangwei Guo,Yang Zhou,Mu Zhou,Dimitris Metaxas

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unified vision-language sparse, offer a natural, natural path, path toward comparable, Learning Unified vision-language

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.

178. 【2602.07310】Optimization of Precipitate Segmentation Through Linear Genetic Programming of Image Processing

链接https://arxiv.org/abs/2602.07310

作者:Kyle Williams,Andrew Seltzman

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:hand annotation due, Current analysis, slowing iteration speed, FIB cross-section micrographs, varying contrast

备注: 39 pages, 12 figures, 1 table

点击查看摘要

Abstract:Current analysis of additive manufactured niobium-based copper alloys relies on hand annotation due to varying contrast, noise, and image artifacts present in micrographs, slowing iteration speed in alloy development. We present a filtering and segmentation algorithm for detecting precipitates in FIB cross-section micrographs, optimized using linear genetic programming (LGP), which accounts for the various artifacts. To this end, the optimization environment uses a domain-specific language for image processing to iterate on solutions. Programs in this language are a list of image-filtering blocks with tunable parameters that sequentially process an input image, allowing for reliable generation and mutation by a genetic algorithm. Our environment produces optimized human-interpretable MATLAB code representing an image filtering pipeline. Under ideal conditions--a population size of 60 and a maximum program length of 5 blocks--our system was able to find a near-human accuracy solution with an average evaluation error of 1.8% when comparing segmentations pixel-by-pixel to a human baseline using an XOR error evaluation. Our automation work enabled faster iteration cycles and furthered exploration of the material composition and processing space: our optimized pipeline algorithm processes a 3.6 megapixel image in about 2 seconds on average. This ultimately enables convergence on strong, low-activation, precipitation hardened copper alloys for additive manufactured fusion reactor parts.

179. 【2602.07301】Diabetic Retinopathy Lesion Segmentation through Attention Mechanisms

链接https://arxiv.org/abs/2602.07301

作者:Aruna Jithesh,Chinmayi Karumuri,Venkata Kiran Reddy Kotha,Meghana Doddapuneni,Taehee Jeong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diabetic Retinopathy, diabetes mellitus, eye disease, disease which arises, arises due

备注

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is an eye disease which arises due to diabetes mellitus. It might cause vision loss and blindness. To prevent irreversible vision loss, early detection through systematic screening is crucial. Although researchers have developed numerous automated deep learning-based algorithms for DR screening, their clinical applicability remains limited, particularly in lesion segmentation. Our method provides pixel-level annotations for lesions, which practically supports Ophthalmologist to screen DR from fundus images. In this work, we segmented four types of DR-related lesions: microaneurysms, soft exudates, hard exudates, and hemorrhages on 757 images from DDR dataset. To enhance lesion segmentation, an attention mechanism was integrated with DeepLab-V3+. Compared to the baseline model, the Attention-DeepLab model increases mean average precision (mAP) from 0.3010 to 0.3326 and the mean Intersection over Union (IoU) from 0.1791 to 0.1928. The model also increased microaneurysm detection from 0.0205 to 0.0763, a clinically significant improvement. The detection of microaneurysms is the earliest visible symptom of DR.

180. 【2602.07277】Cross-View World Models

链接https://arxiv.org/abs/2602.07277

作者:Rishabh Sharma,Gijs Hogervorst,Wayne E. Mackey,David J. Heeger,Stefano Martiniani

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:existing approaches operate, World models enable, imagining future states, make planning easier, Cross-View World Models

备注: 12 pages, 7 figures

点击查看摘要

Abstract:World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.

181. 【2602.07272】VideoNeuMat: Neural Material Extraction from Generative Video Models

链接https://arxiv.org/abs/2602.07272

作者:Bowen Xue,Saeed Hadadan,Zheng Zeng,Fabrice Rousselle,Zahra Montazeri,Milos Hasan

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:rendering requires exceptional, exceptional artistic skill, requires exceptional artistic, rendering requires, artistic skill

备注

点击查看摘要

Abstract:Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.

182. 【2602.07262】wistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

链接https://arxiv.org/abs/2602.07262

作者:Junbo Jacob Lian,Feng Xiong,Yujun Sun,Kaichen Ouyang,Mingyang Yu,Shengwei Fu,Zhong Rui,Zhang Yujun,Huiling Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gram matrices capture, Second-order feature statistics, current methods face, matrices capture global, collapse spatial structure

备注: Code is available at [this https URL](https://github.com/junbolian/TwistNet-2D)

点击查看摘要

Abstract:Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.

183. 【2602.07260】3D Transport-based Morphometry (3D-TBM) for medical image analysis

链接https://arxiv.org/abs/2602.07260

作者:Hongyu Kan,Kristofor Pas,Ivan Medri,Naqib Sad Pathan,Natasha Ironside,Shinjini Kundu,Jingjia He,Gustavo Kunde Rohde

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Transport-Based Morphometry, Morphometry, TBM, TBM facilitates effective, Abstract

备注

点击查看摘要

Abstract:Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.

184. 【2602.07251】he Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

链接https://arxiv.org/abs/2602.07251

作者:Haley Duba-Sullivan,Steven R. Young,Emma J. Reid

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Data-driven super-resolution, improve downstream tasks, imaging pipelines, classification and detection, preprocessing steps

备注

点击查看摘要

Abstract:Data-driven super-resolution (SR) methods are often integrated into imaging pipelines as preprocessing steps to improve downstream tasks such as classification and detection. However, these SR models introduce a previously unexplored attack surface into imaging pipelines. In this paper, we present AdvSR, a framework demonstrating that adversarial behavior can be embedded directly into SR model weights during training, requiring no access to inputs at inference time. Unlike prior attacks that perturb inputs or rely on backdoor triggers, AdvSR operates entirely at the model level. By jointly optimizing for reconstruction quality and targeted adversarial outcomes, AdvSR produces models that appear benign under standard image quality metrics while inducing downstream misclassification. We evaluate AdvSR on three SR architectures (SRCNN, EDSR, SwinIR) paired with a YOLOv11 classifier and demonstrate that AdvSR models can achieve high attack success rates with minimal quality degradation. These findings highlight a new model-level threat for imaging pipelines, with implications for how practitioners source and validate models in safety-critical applications.

185. 【2602.07212】Understanding Real-World Traffic Safety through RoadSafe365 Benchmark

链接https://arxiv.org/abs/2602.07212

作者:Xinyu Liu,Darryl C. Jacob,Yuxin Liu,Xinsong Du,Muchao Ye,Bolei Zhou,Pan He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generally lack systematic, advanced multimodal data, multimodal data analysis, lack systematic evaluation, systematic evaluation aligned

备注

点击查看摘要

Abstract:Although recent traffic benchmarks have advanced multimodal data analysis, they generally lack systematic evaluation aligned with official safety standards. To fill this gap, we introduce RoadSafe365, a large-scale vision-language benchmark that supports fine-grained analysis of traffic safety from extensive and diverse real-world video data collections. Unlike prior works that focus primarily on coarse accident identification, RoadSafe365 is independently curated and systematically organized using a hierarchical taxonomy that refines and extends foundational definitions of crash, incident, and violation to bridge official traffic safety standards with data-driven traffic understanding systems. RoadSafe365 provides rich attribute annotations across diverse traffic event types, environmental contexts, and interaction scenarios, yielding 36,196 annotated clips from both dashcam and surveillance cameras. Each clip is paired with multiple-choice question-answer sets, comprising 864K candidate options, 8.4K unique answers, and 36K detailed scene descriptions collectively designed for vision-language understanding and reasoning. We establish strong baselines and observe consistent gains when fine-tuning on RoadSafe365. Cross-domain experiments on both real and synthetic datasets further validate its effectiveness. Designed for large-scale training and standardized evaluation, RoadSafe365 provides a comprehensive benchmark to advance reproducible research in real-world traffic safety analysis.

186. 【2602.07198】Condition Matters in Full-head 3D GANs

链接https://arxiv.org/abs/2602.07198

作者:Heyuan Li,Huimin Zhang,Yuda Qiu,Zhengwentai Sun,Keru Zheng,Lingteng Qiu,Peihao Li,Qi Zuo,Ce Chen,Yujian Zheng,Yuming Gu,Zilong Dong,Xiaoguang Han

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:crucial for stable, Conditioning, semantic, conditioning input, semantic condition

备注: Accepted by ICLR 2026. Project page: [this https URL](https://lhyfst.github.io/balancehead/)

点击查看摘要

Abstract:Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.

187. 【2602.07174】DuMeta++: Spatiotemporal Dual Meta-Learning for Generalizable Few-Shot Brain Tissue Segmentation Across Diverse Ages

链接https://arxiv.org/abs/2602.07174

作者:Yongheng Sun,Jun Shu,Jianhua Ma,Fan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving consistent performance, human lifespan remains, lifespan remains challenging, remains challenging due, tissues from MRI

备注

点击查看摘要

Abstract:Accurate segmentation of brain tissues from MRI scans is critical for neuroscience and clinical applications, but achieving consistent performance across the human lifespan remains challenging due to dynamic, age-related changes in brain appearance and morphology. While prior work has sought to mitigate these shifts by using self-supervised regularization with paired longitudinal data, such data are often unavailable in practice. To address this, we propose \emph{DuMeta++}, a dual meta-learning framework that operates without paired longitudinal data. Our approach integrates: (1) meta-feature learning to extract age-agnostic semantic representations of spatiotemporally evolving brain structures, and (2) meta-initialization learning to enable data-efficient adaptation of the segmentation model. Furthermore, we propose a memory-bank-based class-aware regularization strategy to enforce longitudinal consistency without explicit longitudinal supervision. We theoretically prove the convergence of our DuMeta++, ensuring stability. Experiments on diverse datasets (iSeg-2019, IBIS, OASIS, ADNI) under few-shot settings demonstrate that DuMeta++ outperforms existing methods in cross-age generalization. Code will be available at this https URL.

188. 【2602.07156】Mimetic Initialization of MLPs

链接https://arxiv.org/abs/2602.07156

作者:Asher Trockman,J. Zico Kolter

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Mimetic initialization, pretrained models, models as case, case studies, studies of good

备注

点击查看摘要

Abstract:Mimetic initialization uses pretrained models as case studies of good initialization, using observations of structures in trained weights to inspire new, simple initialization techniques. So far, it has been applied only to spatial mixing layers, such convolutional, self-attention, and state space layers. In this work, we present the first attempt to apply the method to channel mixing layers, namely multilayer perceptrons (MLPs). Our extremely simple technique for MLPs -- to give the first layer a nonzero mean -- speeds up training on small-scale vision tasks like CIFAR-10 and ImageNet-1k. Though its effect is much smaller than spatial mixing initializations, it can be used in conjunction with them for an additional positive effect.

189. 【2602.07149】Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

链接https://arxiv.org/abs/2602.07149

作者:Rawisara Lohanimit,Yankun Wu,Amelia Katirai,Yuta Nakashima,Noa Garcia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale datasets collected, rise of generative, generative models, models has led, led to increased

备注

点击查看摘要

Abstract:The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.

190. 【2602.07125】Reasoning-Augmented Representations for Multimodal Retrieval

链接https://arxiv.org/abs/2602.07125

作者:Jianrui Zhang,Anirudh Sundara Rajan,Brandon Han,Soochahn Lee,Sukanta Ganguly,Yong Jae Lee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Universal Multimodal Retrieval, models remain brittle, require latent reasoning, Universal Multimodal, queries require latent

备注

点击查看摘要

Abstract:Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at this https URL.

191. 【2602.07106】Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

链接https://arxiv.org/abs/2602.07106

作者:Haoyu Zhang,Zhipeng Li,Yiwen Guo,Tianshu Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, unify multimodal understanding, remains largely unexplored, Omni-modal large language, animation remains largely

备注

点击查看摘要

Abstract:Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

192. 【2602.07104】Extended to Reality: Prompt Injection in 3D Environments

链接https://arxiv.org/abs/2602.07104

作者:Zhuoheng Li,Ying Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, large language models, situated conversational agents, Multimodal large, empowering diverse applications

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced the capabilities to interpret and act on visual input in 3D environments, empowering diverse applications such as robotics and situated conversational agents. When MLLMs reason over camera-captured views of the physical world, a new attack surface emerges: an attacker can place text-bearing physical objects in the environment to override MLLMs' intended task. While prior work has studied prompt injection in the text domain and through digitally edited 2D images, it remains unclear how these attacks function in 3D physical environments. To bridge the gap, we introduce PI3D, a prompt injection attack against MLLMs in 3D environments, realized through text-bearing physical object placement rather than digital image edits. We formulate and solve the problem of identifying an effective 3D object pose (position and orientation) with injected text, where the attacker's goal is to induce the MLLM to perform the injected task while ensuring that the object placement remains physically plausible. Experiments demonstrate that PI3D is an effective attack against multiple MLLMs under diverse camera trajectories. We further evaluate existing defenses and show that they are insufficient to defend against PI3D.

193. 【2602.07101】Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting

链接https://arxiv.org/abs/2602.07101

作者:Zinan Lv,Yeqian Qian,Chen Sang,Hao Liu,Danping Zou,Ming Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:passive monocular vision, substantial visual domain, visual domain gap, Gaussian Splatting enables, Gaussian Splatting

备注: 12 pages, 8 figures

点击查看摘要

Abstract:UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.

194. 【2602.07100】LC-Plan: A Two-Level Codebook Based Network for End-to-End Vector Floorplan Generation

链接https://arxiv.org/abs/2602.07100

作者:Biao Xiong,Zhen Peng,Ping Wang,Qiegen Liu,Xian Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise geometric detail, Automated floorplan generation, improve design quality, jointly modeling global, Automated floorplan

备注

点击查看摘要

Abstract:Automated floorplan generation aims to improve design quality, architectural efficiency, and sustainability by jointly modeling global spatial organization and precise geometric detail. However, existing approaches operate in raster space and rely on post hoc vectorization, which introduces structural inconsistencies and hinders end-to-end learning. Motivated by compositional spatial reasoning, we propose TLC-Plan, a hierarchical generative model that directly synthesizes vector floorplans from input boundaries, aligning with human architectural workflows based on modular and reusable patterns. TLC-Plan employs a two-level VQ-VAE to encode global layouts as semantically labeled room bounding boxes and to refine local geometries using polygon-level codes. This hierarchy is unified in a CodeTree representation, while an autoregressive transformer samples codes conditioned on the boundary to generate diverse and topologically valid designs, without requiring explicit room topology or dimensional priors. Extensive experiments show state-of-the-art performance on RPLAN dataset (FID = 1.84, MSE = 2.06) and leading results on LIFULL dataset. The proposed framework advances constraint-aware and scalable vector floorplan generation for real-world architectural applications. Source code and trained models are released at this https URL.

195. 【2602.07095】WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

链接https://arxiv.org/abs/2602.07095

作者:Wang Lin,Feng Wang,Majun Zhang,Wentao Hu,Tao Jin,Zhou Zhao,Fei Wu,Jingyuan Chen,Alan Yuille,Sucheng Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable capabilities, Recent advances, style transfer, executing explicit instructions, attribute manipulation

备注

点击查看摘要

Abstract:Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.

196. 【2602.07082】MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

链接https://arxiv.org/abs/2602.07082

作者:Haoming Wang,Qiyao Xue,Weichen Liu,Wei Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:traditional object detection, VLM spatial reasoning, spatial reasoning, actuation planning, traditional object

备注

点击查看摘要

Abstract:When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.

197. 【2602.07081】Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

链接https://arxiv.org/abs/2602.07081

作者:Thu Hang Phung,Duong M. Nguyen,Thanh Trung Huynh,Quoc Viet Hung Nguyen,Trong Nghia Hoang,Phi Le Nguyen

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:generalized federated prompt-tuning, input level, practical scenarios, scenarios where local, distributional patterns

备注

点击查看摘要

Abstract:This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework bridges the gap between federated learning and multi-modal prompt-tuning which have traditionally focused on either uni-modal or centralized data. A key challenge in this setting arises from the lack of semantic alignment between prompt instructions that encode similar distributional patterns of missing data across different clients. To address this, our framework introduces specialized client-tuning and server-aggregation designs that simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities. This allows prompt instructions to complement one another and be combined effectively. Extensive evaluations on diverse multimodal benchmark datasets demonstrate that our work consistently outperforms state-of-the-art (SOTA) baselines.

198. 【2602.07069】Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution

链接https://arxiv.org/abs/2602.07069

作者:Zihao Fan,Xin Lu,Yidi Liu,Jie Huang,Dong Li,Xueyang Fu,Zheng-Jun Zha

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:synthesize rich details, Diffusion-based super-resolution, synthetic paired data, rich details, synthesize rich

备注

点击查看摘要

Abstract:Diffusion-based super-resolution can synthesize rich details, but models trained on synthetic paired data often fail on real-world LR images due to distribution shifts. We propose Bird-SR, a bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied at later sampling steps to both synthetic and real LR images. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their clean counterparts, while real-world optimization is regularized via a semantic alignment constraint. Furthermore, to balance structural and perceptual learning, we adopt a dynamic fidelity-perception weighting strategy that emphasizes structure preservation at early stages and progressively shifts focus toward perceptual optimization at later diffusion steps. Extensive experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality while preserving structural consistency, validating its effectiveness for real-world super-resolution.

199. 【2602.07065】Contactless estimation of continuum displacement and mechanical compressibility from image series using a deep learning based framework

链接https://arxiv.org/abs/2602.07065

作者:A.N. Maria Antony(1),T. Richter(2),E. Gladilin(1) ((1) Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), Seeland, Germany, (2) Otto-von-Guericke Universität, Magdeburg, Germany)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:direct physical measurements, Finite Element Method, Finite Difference Method, Contactless and non-invasive, physical media

备注: 14 Pages, 8 Figures Note: Supplentary information (ancillary file) attached as .pdf

点击查看摘要

Abstract:Contactless and non-invasive estimation of mechanical properties of physical media from optical observations is of interest for manifold engineering and biomedical applications, where direct physical measurements are not possible. Conventional approaches to the assessment of image displacement and non-contact material probing typically rely on time-consuming iterative algorithms for non-rigid image registration and constitutive modelling using discretization and iterative numerical solving techniques, such as Finite Element Method (FEM) and Finite Difference Method (FDM), which are not suitable for high-throughput data processing. Here, we present an efficient deep learning based end-to-end approach for the estimation of continuum displacement and material compressibility directly from the image series. Based on two deep neural networks for image registration and material compressibility estimation, this framework outperforms conventional approaches in terms of efficiency and accuracy. In particular, our experimental results show that the deep learning model trained on a set of reference data can accurately determine the material compressibility even in the presence of substantial local deviations of the mapping predicted by image registration from the reference displacement field. Our findings suggest that the remarkable accuracy of the deep learning end-to-end model originates from its ability to assess higher-order cognitive features, such as the vorticity of the vector field, rather than conventional local features of the image displacement.

200. 【2602.07064】Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

链接https://arxiv.org/abs/2602.07064

作者:Minghao Han,Dingkang Yang,Yue Jiang,Yizhou Liu,Lihua Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding remains brittle, Physical understanding remains, remains brittle, visually ambiguous, ambiguous and sparsely

备注

点击查看摘要

Abstract:Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

201. 【2602.07063】Video-based Music Generation

链接https://arxiv.org/abs/2602.07063

作者:Serkan Sulun

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:internet grows rapidly, suitable soundtrack remains, grows rapidly, finding a suitable, significant challenge

备注: PhD thesis, University of Porto

点击查看摘要

Abstract:As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for affective music generation. We then present an emotion-based MIDI generator, the first to condition on continuous emotional values rather than discrete categories, enabling nuanced music generation aligned with complex emotional content. To enhance temporal synchronization, we introduce a novel temporal boundary conditioning method, called "boundary offset encodings," aligning musical chords with scene changes. Combining video emotion classification, emotion-based music generation, and temporal boundary conditioning, EMSYNC emerges as a fully automatic video-based music generator. User studies show that it consistently outperforms existing methods in terms of music richness, emotional alignment, temporal synchronization, and overall preference, setting a new state-of-the-art in video-based music generation.

202. 【2602.07062】From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal

链接https://arxiv.org/abs/2602.07062

作者:Daniil Storonkin,Ilia Dziub,Maksim Golyadkin,Ilya Makarov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:quality directly affects, directly affects energy, Scrap quality directly, quality directly, directly affects

备注: AAAI 2026 Workshop on Addressing Challenges and Opportunities in Human-Centric Manufacturing

点击查看摘要

Abstract:Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.

203. 【2602.07058】FADE: Selective Forgetting via Sparse LoRA and Self-Distillation

链接https://arxiv.org/abs/2602.07058

作者:Carolina R. Kelsch,Leonardo S. B. Pereira,Natnael Mola,Luis H. Arribas,Juan C. S. M. Avedillo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Machine Unlearning aims, capability increasingly required, data protection regulations, Machine Unlearning, aims to remove

备注

点击查看摘要

Abstract:Machine Unlearning aims to remove the influence of specific data or concepts from trained models while preserving overall performance, a capability increasingly required by data protection regulations and responsible AI practices. Despite recent progress, unlearning in text-to-image diffusion models remains challenging due to high computational costs and the difficulty of balancing effective forgetting with retention of unrelated concepts. We introduce FADE (Fast Adapter for Data Erasure), a two-stage unlearning method for image generation that combines parameter localization with self-distillation. FADE first identifies parameters most responsible for the forget set using gradient-based saliency and constrains updates through sparse LoRA adapters, ensuring lightweight, localized modifications. In a second stage, FADE applies a self-distillation objective that overwrites the forgotten concept with a user-defined surrogate while preserving behavior on retained data. The resulting adapters are memory-efficient, reversible, and can be merged or removed at runtime, enabling flexible deployment in production systems. We evaluated FADE on the UnlearnCanvas benchmark and conducted ablation studies on Imagenette, Labeled Faces in the Wild, AtharvaTaras Dog Breeds Dataset, and SUN Attributes datasets, demonstrating State-of-the-Art unlearning performance with fine-grained control over the forgetting-retention trade-off. Our results demonstrate that FADE achieves strong concept erasure and high retainability across various domains, making it a suitable solution for selective unlearning in diffusion-based image generation models.

204. 【2602.07057】RECITYGEN -- Interactive and Generative Participatory Urban Design Tool with Latent Diffusion and Segment Anything

链接https://arxiv.org/abs/2602.07057

作者:Di Mo,Mingyang Sun,Chengxiu Yin,Runjia Tian,Yanhong Wu,Liyan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:impacts public spaces, profoundly impacts public, design profoundly impacts, City Information Modelling, community engagement

备注

点击查看摘要

Abstract:Urban design profoundly impacts public spaces and community engagement. Traditional top-down methods often overlook public input, creating a gap in design aspirations and reality. Recent advancements in digital tools, like City Information Modelling and augmented reality, have enabled a more participatory process involving more stakeholders in urban design. Further, deep learning and latent diffusion models have lowered barriers for design generation, providing even more opportunities for participatory urban design. Combining state-of-the-art latent diffusion models with interactive semantic segmentation, we propose RECITYGEN, a novel tool that allows users to interactively create variational street view images of urban environments using text prompts. In a pilot project in Beijing, users employed RECITYGEN to suggest improvements for an ongoing Urban Regeneration project. Despite some limitations, RECITYGEN has shown significant potential in aligning with public preferences, indicating a shift towards more dynamic and inclusive urban planning methods. The source code for the project can be found at RECITYGEN GitHub.

205. 【2602.07054】AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

链接https://arxiv.org/abs/2602.07054

作者:Ashutosh Chaubey,Jiacheng Pang,Maksim Siniukov,Mohammad Soleymani

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:socially intelligent agents, building socially intelligent, intelligent agents, essential for building, building socially

备注: Accepted as a conference paper at ICLR 2026. Project page: [this https URL](https://avere-iclr.github.io)

点击查看摘要

Abstract:Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at this https URL.

206. 【2602.07052】oward Accurate and Accessible Markerless Neuronavigation

链接https://arxiv.org/abs/2602.07052

作者:Ziye Xie,Oded Schlesinger,Raj Kundu,Jessica Y. Choi,Pablo Iturralde,Dennis A. Turner,Stefan M. Goetz,Guillermo Sapiro,Angel V. Peterchev,J. Matias Di Martino

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:transcranial magnetic stimulation, interventions to guide, guide the precise, precise placement, placement of instruments

备注

点击查看摘要

Abstract:Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing combined with algorithmic modeling of the facial geometry. Validation with $50$ human subjects yielded a median tracking discrepancy of only $2.32$ mm and $2.01°$ for the best markerless algorithms compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The results suggest that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.

207. 【2602.07051】Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

链接https://arxiv.org/abs/2602.07051

作者:Karthik Sivakoti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:separate Optical Character, Optical Character Recognition, Automatic License Plate, License Plate Recognition, Traditional Automatic License

备注

点击查看摘要

Abstract:Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.

208. 【2602.07050】Interpreting Physics in Video World Models

链接https://arxiv.org/abs/2602.07050

作者:Sonia Joseph,Quentin Garrido,Randall Balestriero,Matthew Kowal,Thomas Fel,Shahab Bakhtiari,Blake Richards,Mike Rabbat

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:make physically accurate, Physics Emergence Zone, physically accurate predictions, Emergence Zone, long-standing question

备注

点击查看摘要

Abstract:A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.07050 [cs.CV]

(or
arXiv:2602.07050v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.07050

Focus to learn more

              arXiv-issued DOI via DataCite</p>
209. 【2602.07049】Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead

链接https://arxiv.org/abs/2602.07049

作者:Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Online handwriting recognition, inertial measurement units, measurement units opens, Online handwriting, digital devices

备注

点击查看摘要

Abstract:Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.

210. 【2602.07047】ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees

链接https://arxiv.org/abs/2602.07047

作者:Muhammad Rashid,Elvio G. Amparore,Enrico Ferrari,Damiano Verda

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Computer Vision, influence model predictions, Pixel-level feature attributions, Pixel-level feature, Computer Vision tasks

备注: AAAI-2026

点击查看摘要

Abstract:Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data. To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead. This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT's effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.

211. 【2602.07045】VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

链接https://arxiv.org/abs/2602.07045

作者:Zhiming Luo,Di Wang,Haonan Guo,Jing Zhang,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Recent advancements, Language Models, Large Language

备注

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.

212. 【2602.07044】PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

链接https://arxiv.org/abs/2602.07044

作者:Tianyi Qu,Songxiao Yang,Haolin Wang,Huadong Song,Xiaoting Guo,Wenguang Hu,Guanlin Liu,Honghe Chen,Yafei Ou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Magnetic Flux Leakage, Flux Leakage, Magnetic Flux, non-destructive testing technology, primary non-destructive testing

备注: A dataset contains 240,320 pipeline MFL pseudo-color images and 191,530 bounding-box annotations, collected from 11 pipelines spanning approximately 1,480 km

点击查看摘要

Abstract:Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels, and (iii) substantial intra-class variability. The dataset contains \textbf{240,320} images and \textbf{191,530} high-quality bounding-box annotations, collected from 11 pipelines spanning approximately \textbf{1,480} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

213. 【2602.07042】COMBOOD: A Semiparametric Approach for Detecting Out-of-distribution Data for Image Classification

链接https://arxiv.org/abs/2602.07042

作者:Magesh Rajasekaran,Md Saiful Islam Sajol,Frej Berglind,Supratik Mukhopadhyay,Kamalika Das

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning applications, OOD detection, OOD, COMBOOD framework, COMBOOD

备注: Copyright by SIAM. Unauthorized reproduction of this article is prohibited First Published in Proceedings of the 2024 SIAM International Conference on Data Mining (SDM24), published by the Society for Industrial and Applied Mathematics (SIAM)

点击查看摘要

Abstract:Identifying out-of-distribution (OOD) data at inference time is crucial for many machine learning applications, especially for automation. We present a novel unsupervised semi-parametric framework COMBOOD for OOD detection with respect to image recognition. Our framework combines signals from two distance metrics, nearest-neighbor and Mahalanobis, to derive a confidence score for an inference point to be out-of-distribution. The former provides a non-parametric approach to OOD detection. The latter provides a parametric, simple, yet effective method for detecting OOD data points, especially, in the far OOD scenario, where the inference point is far apart from the training data set in the embedding space. However, its performance is not satisfactory in the near OOD scenarios that arise in practical situations. Our COMBOOD framework combines the two signals in a semi-parametric setting to provide a confidence score that is accurate both for the near-OOD and far-OOD scenarios. We show experimental results with the COMBOOD framework for different types of feature extraction strategies. We demonstrate experimentally that COMBOOD outperforms state-of-the-art OOD detection methods on the OpenOOD (both version 1 and most recent version 1.5) benchmark datasets (for both far-OOD and near-OOD) as well as on the documents dataset in terms of accuracy. On a majority of the benchmark datasets, the improvements in accuracy resulting from the COMBOOD framework are statistically significant. COMBOOD scales linearly with the size of the embedding space, making it ideal for many real-life applications.

214. 【2602.07041】OMNI-Dent: Towards an Accessible and Explainable AI Framework for Automated Dental Diagnosis

链接https://arxiv.org/abs/2602.07041

作者:Leeje Jang,Yao-Yi Chiang,Angela M. Hastings,Patimaporn Pungchanchaikul,Martha B. Lucas,Emily C. Schultz,Jeffrey P. Louie,Mohamed Estai,Wen-Chen Wang,Ryan H.L. Ip,Boyen Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate dental diagnosis, Accurate dental, oral healthcare, essential for oral, timely professional evaluation

备注

点击查看摘要

Abstract:Accurate dental diagnosis is essential for oral healthcare, yet many individuals lack access to timely professional evaluation. Existing AI-based methods primarily treat diagnosis as a visual pattern recognition task and do not reflect the structured clinical reasoning used by dental professionals. These approaches also require large amounts of expert-annotated data and often struggle to generalize across diverse real-world imaging conditions. To address these limitations, we present OMNI-Dent, a data-efficient and explainable diagnostic framework that incorporates clinical reasoning principles into a Vision-Language Model (VLM)-based pipeline. The framework operates on multi-view smartphone photographs,embeds diagnostic heuristics from dental experts, and guides a general-purpose VLM to perform tooth-level evaluation without dental-specific fine-tuning of the VLM. By utilizing the VLM's existing visual-linguistic capabilities, OMNI-Dent aims to support diagnostic assessment in settings where curated clinical imaging is unavailable. Designed as an early-stage assistive tool, OMNI-Dent helps users identify potential abnormalities and determine when professional evaluation may be needed, offering a practical option for individuals with limited access to in-person care.

215. 【2602.07038】UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

链接https://arxiv.org/abs/2602.07038

作者:Yifan Ji,Zhipeng Xu,Zhenghao Liu,Zulong Chen,Qian Zhang,Zhibo Yang,Junyang Lin,Yu Gu,Ge Yu,Maosong Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:remains challenging due, Key Information Extraction, Large Multimodal Models, task-specific information requirements, real-world documents remains

备注

点击查看摘要

Abstract:Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at this https URL.

216. 【2602.07037】Stochastic Spiking Neuron Based SNN Can be Inherently Bayesian

链接https://arxiv.org/abs/2602.07037

作者:Huannan Zheng,Jingli Liu,Kezhou Yang

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:Magnetic Tunnel Junctions, computationally beneficial, biological neural systems, Tunnel Junctions, Bayesian neural network

备注

点击查看摘要

Abstract:Uncertainty in biological neural systems appears to be computationally beneficial rather than detrimental. However, in neuromorphic computing systems, device variability often limits performance, including accuracy and efficiency. In this work, we propose a spiking Bayesian neural network (SBNN) framework that unifies the dynamic models of intrinsic device stochasticity (based on Magnetic Tunnel Junctions) and stochastic threshold neurons to leverage noise as a functional Bayesian resource. Experiments demonstrate that SBNN achieves high accuracy (99.16% on MNIST, 94.84% on CIFAR10) with 8-bit precision. Meanwhile rate estimation method provides a ~20-fold training speedup. Furthermore, SBNN exhibits superior robustness, showing a 67% accuracy improvement under synaptic weight noise and 12% under input noise compared to standard spiking neural networks. Crucially, hardware validation confirms that physical device implementation causes invisible accuracy and calibration loss compared to the algorithmic model. Converting device stochasticity into neuronal uncertainty offers a route to compact, energy-efficient neuromorphic computing under uncertainty.

217. 【2602.07028】A Comparative Study of Adversarial Robustness in CNN and CNN-ANFIS Architectures

链接https://arxiv.org/abs/2602.07028

作者:Kaaustaaub Shankar,Bharadwaj Dogga,Kelly Cohen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Convolutional Neural Networks, Convolutional Neural, Neural Networks, achieve strong image, strong image classification

备注: Accepted to NAFIPS 2026

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) achieve strong image classification performance but lack interpretability and are vulnerable to adversarial attacks. Neuro-fuzzy hybrids such as DCNFIS replace fully connected CNN classifiers with Adaptive Neuro-Fuzzy Inference Systems (ANFIS) to improve interpretability, yet their robustness remains underexplored. This work compares standard CNNs (ConvNet, VGG, ResNet18) with their ANFIS-augmented counterparts on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 under gradient-based (PGD) and gradient-free (Square) attacks. Results show that ANFIS integration does not consistently improve clean accuracy and has architecture-dependent effects on robustness: ResNet18-ANFIS exhibits improved adversarial robustness, while VGG-ANFIS often underperforms its baseline. These findings suggest that neuro-fuzzy augmentation can enhance robustness in specific architectures but is not universally beneficial.

218. 【2602.07027】Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

链接https://arxiv.org/abs/2602.07027

作者:Sanggeon Yun,Ryozo Masukawa,SungHeon Jeong,Wenjun Huang,Hanning Chen,Mohsen Imani

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:strong zero-shot recognition, suffer substantial degradation, CLIP enable strong, Vision-Language Models, enable strong zero-shot

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.

219. 【2602.07026】Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

链接https://arxiv.org/abs/2602.07026

作者:Xiaomin Yu,Yi Xin,Wenjie Zhang,Chonghan Liu,Hanzhen Zhao,Xiaoxing Hu,Xinlei Yu,Ziyue Qiao,Hao Tang,Xue Yang,Xiaobin Hu,Chengwei Qin,Hui Xiong,Yu Qiao,Shuicheng Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:systematically offset regions, distinct modalities expressing, modalities expressing identical, expressing identical semantics, identical semantics occupy

备注

点击查看摘要

Abstract:Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

220. 【2602.07025】he Geometry of Representational Failures in Vision Language Models

链接https://arxiv.org/abs/2602.07025

作者:Daniele Savietto,Declan Campbell,André Panisson,Marco Nurisso,Giovanni Petri,Jonathan D. Cohen,Alan Perotti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:hallucinating non-existent elements, exhibit puzzling failures, exhibit puzzling, objects among distractions, hallucinating non-existent

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

221. 【2602.07024】A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

链接https://arxiv.org/abs/2602.07024

作者:Valerio Belcamino,Nhat Minh Dinh Le,Quan Khanh Luu,Alessandro Carfì,Van Anh Ho,Fulvio Mastrogiovanni

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Human activity recognition, Inertial Measurement Units, human intentions, human-robot collaboration, fundamental in human-robot

备注

点击查看摘要

Abstract:Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond to and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.

222. 【2602.07019】Deep Learning Based Multi-Level Classification for Aviation Safety

链接https://arxiv.org/abs/2602.07019

作者:Elaheh Sabziyan Varnousfaderani,Syed A. M. Shihab,Jonathan King

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:substantial financial costs, Bird strikes pose, severe aircraft damage, loss of life, severe aircraft

备注

点击查看摘要

Abstract:Bird strikes pose a significant threat to aviation safety, often resulting in loss of life, severe aircraft damage, and substantial financial costs. Existing bird strike prevention strategies primarily rely on avian radar systems that detect and track birds in real time. A major limitation of these systems is their inability to identify bird species, an essential factor, as different species exhibit distinct flight behaviors, and altitudinal preference. To address this challenge, we propose an image-based bird classification framework using Convolutional Neural Networks (CNNs), designed to work with camera systems for autonomous visual detection. The CNN is designed to identify bird species and provide critical input to species-specific predictive models for accurate flight path prediction. In addition to species identification, we implemented dedicated CNN classifiers to estimate flock formation type and flock size. These characteristics provide valuable supplementary information for aviation safety. Specifically, flock type and size offer insights into collective flight behavior, and trajectory dispersion . Flock size directly relates to the potential impact severity, as the overall damage risk increases with the combined kinetic energy of multiple birds.

223. 【2602.07017】XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

链接https://arxiv.org/abs/2602.07017

作者:Thuraya Alzubaidi,Sana Ammar,Maryam Alsharqi,Islem Rekik,Muzammil Behzad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enabling accurate diagnosis, Medical image segmentation, treatment planning, accurate diagnosis, disease monitoring

备注

点击查看摘要

Abstract:Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.

224. 【2602.07016】Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency

链接https://arxiv.org/abs/2602.07016

作者:Mohsen Mostafa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant visual ambiguity, unstructured image collections, image collections remains, multiple unrelated scenes, Image Matching Challenge

备注: 10 pages, 3 figures, [this https URL](https://www.kaggle.com/code/babydriver1233/optimized-pipeline-for-the-image-matching-challeng) , [this https URL](https://www.kaggle.com/code/babydriver1233/integrating-lejepa-for-enhanced-image-matching)

点击查看摘要

Abstract:Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.

225. 【2602.07015】Robust and Real-Time Bangladeshi Currency Recognition: A Dual-Stream MobileNet and EfficientNet Approach

链接https://arxiv.org/abs/2602.07015

作者:Subreena,Mohammad Amzad Hossain,Mirza Raquib,Saydul Akbar Murad,Farida Siddiqi Prity,Muhammad Hanif,Nick Rahimi

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Accurate currency recognition, visually impaired individuals, Accurate currency, assistive technologies, essential for assistive

备注

点击查看摘要

Abstract:Accurate currency recognition is essential for assistive technologies, particularly for visually impaired individuals who rely on others to identify banknotes. This dependency puts them at risk of fraud and exploitation. To address these challenges, we first build a new Bangladeshi banknote dataset that includes both controlled and real-world scenarios, ensuring a more comprehensive and diverse representation. Next, to enhance the dataset's robustness, we incorporate four additional datasets, including public benchmarks, to cover various complexities and improve the model's generalization. To overcome the limitations of current recognition models, we propose a novel hybrid CNN architecture that combines MobileNetV3-Large and EfficientNetB0 for efficient feature extraction. This is followed by an effective multilayer perceptron (MLP) classifier to improve performance while keeping computational costs low, making the system suitable for resource-constrained devices. The experimental results show that the proposed model achieves 97.95% accuracy on controlled datasets, 92.84% on complex backgrounds, and 94.98% accuracy when combining all datasets. The model's performance is thoroughly evaluated using five-fold cross-validation and seven metrics: accuracy, precision, recall, F1-score, Cohen's Kappa, MCC, and AUC. Additionally, explainable AI methods like LIME and SHAP are incorporated to enhance transparency and interpretability.

226. 【2602.07014】Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation

链接https://arxiv.org/abs/2602.07014

作者:Qingyu Wu,Yuxuan Han,Haijun Li,Zhao Xu,Jianshan Zhao,Xu Jin,Longyue Wang,Weihua Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:In-Image Machine Translation, machine translation evaluation, Machine Translation, existing research focuses, powers cross-border e-commerce

备注

点击查看摘要

Abstract:In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.

227. 【2602.07013】Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

链接https://arxiv.org/abs/2602.07013

作者:Jiaxi Yang,Shicheng Liu,Yuchen Yang,Dongwon Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Vision Language Models, safe model behavior, Language Models, model behavior, safe model

备注

点击查看摘要

Abstract:With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.

228. 【2602.07012】A General Model for Retinal Segmentation and Quantification

链接https://arxiv.org/abs/2602.07012

作者:Zhonghua Wang,Lie Ju,Sijia Li,Wei Feng,Sijin Zhou,Ming Hu,Jianhao Xiong,Xiaoying Tang,Yifan Peng,Mingquan Lin,Yaodong Ding,Yong Zeng,Wenbin Wei,Li Dong,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:systemic health assessment, offering quantifiable structural, offering quantifiable, health assessment, systemic health

备注

点击查看摘要

Abstract:Retinal imaging is fast, non-invasive, and widely available, offering quantifiable structural and vascular signals for ophthalmic and systemic health assessment. This accessibility creates an opportunity to study how quantitative retinal phenotypes relate to ocular and systemic diseases. However, such analyses remain difficult at scale due to the limited availability of public multi-label datasets and the lack of a unified segmentation-to-quantification pipeline. We present RetSAM, a general retinal segmentation and quantification framework for fundus imaging. It delivers robust multi-target segmentation and standardized biomarker extraction, supporting downstream ophthalmologic studies and oculomics correlation analyses. Trained on over 200,000 fundus images, RetSAM supports three task categories and segments five anatomical structures, four retinal phenotypic patterns, and more than 20 distinct lesion types. It converts these segmentation results into over 30 standardized biomarkers that capture structural morphology, vascular geometry, and degenerative changes. Trained with a multi-stage strategy using both private and public fundus data, RetSAM achieves superior segmentation performance on 17 public datasets. It improves on prior best methods by 3.9 percentage points in DSC on average, with up to 15 percentage points on challenging multi-task benchmarks, and generalizes well across diverse populations, imaging devices, and clinical settings. The resulting biomarkers enable systematic correlation analyses across major ophthalmic diseases, including diabetic retinopathy, age-related macular degeneration, glaucoma, and pathologic myopia. Together, RetSAM transforms fundus images into standardized, interpretable quantitative phenotypes, enabling large-scale ophthalmic research and translation.

229. 【2602.07011】MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation

链接https://arxiv.org/abs/2602.07011

作者:Zhuonan Wang,Zhenxuan Fan,Siwen Tan,Yu Zhong,Yuqian Yuan,Haoyuan Li,Hao Jiang,Wenqiao Zhang,Feifei Shao,Hongwei Wang,Jun Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:automating fine-grained product, fine-grained product image, product image analysis, industrial manufacturing scales, manufacturing scales

备注: 9 pages, 5 figures

点击查看摘要

Abstract:As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.

230. 【2602.07008】Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

链接https://arxiv.org/abs/2602.07008

作者:Ruoyu Chen,Shangquan Sun,Xiaoqing Guo,Sanyi Zhang,Kangwei Liu,Shiming Liu,Zhangcheng Wang,Qunli Zhang,Hua Zhang,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Reliable models, predict correctly, Reliable, Human, human prior

备注

点击查看摘要

Abstract:Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

231. 【2602.07006】Scalable spatial point process models for forensic footwear analysis

链接https://arxiv.org/abs/2602.07006

作者:Alokesh Manna,Neil Spencer,Dipak K. Dey

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:crime scenes plays, plays a key, key role, print evidence recovered, Shoe

备注

点击查看摘要

Abstract:Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.

232. 【2602.06995】When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey

链接https://arxiv.org/abs/2602.06995

作者:Konstantinos Gounis,Sotiris A. Tegos,Dimitrios Tyrovolas,Panagiotis D. Diamantoulakis,George K. Karagiannidis

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Multiagent Systems (cs.MA)

关键词:sensing equipment combined, commercial wireless communication, Wireless Communications, intelligent autonomous systems, autonomous systems paves

备注

点击查看摘要

Abstract:The availability of commercial wireless communication and sensing equipment combined with the advancements in intelligent autonomous systems paves the way towards robust joint communications and simultaneous localization and mapping (SLAM). This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze mathematical approaches such as probabilistic models, and spatial methods for signal processing, as well as key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM are still in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

233. 【2602.06991】LangGS-SLAM: Real-Time Language-Feature Gaussian Splatting SLAM

链接https://arxiv.org/abs/2602.06991

作者:Seongbo Ha,Sibaek Lee,Kyungsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:sustaining low-latency tracking, RGB-D SLAM system, propose a RGB-D, RGB-D SLAM, tracking and mapping

备注: 17 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a RGB-D SLAM system that reconstructs a language-aligned dense feature field while sustaining low-latency tracking and mapping. First, we introduce a Top-K Rendering pipeline, a high-throughput and semantic-distortion-free method for efficiently rendering high-dimensional feature maps. To address the resulting semantic-geometric discrepancy and mitigate the memory consumption, we further design a multi-criteria map management strategy that prunes redundant or inconsistent Gaussians while preserving scene integrity. Finally, a hybrid field optimization framework jointly refines the geometric and semantic fields under real-time constraints by decoupling their optimization frequencies according to field characteristics. The proposed system achieves superior geometric fidelity compared to geometric-only baselines and comparable semantic fidelity to offline approaches while operating at 15 FPS. Our results demonstrate that online SLAM with dense, uncompressed language-aligned feature fields is both feasible and effective, bridging the gap between 3D perception and language-based reasoning.

234. 【2602.06974】FeudalNav: A Simple Framework for Visual Navigation

链接https://arxiv.org/abs/2602.06974

作者:Faith Johnson,Bryan Bo Cao,Shubham Jain,Ashwin Ashok,Kristin Dana

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:detailed maps, robotics is inspired, ability to navigate, visual cues, navigate environments

备注: 8 Pages, 6 figures and 4 tables. arXiv admin note: substantial text overlap with [arXiv:2411.09893](https://arxiv.org/abs/2411.09893) , [arXiv:2402.12498](https://arxiv.org/abs/2402.12498)

点击查看摘要

Abstract:Visual navigation for robotics is inspired by the human ability to navigate environments using visual cues and memory, eliminating the need for detailed maps. In unseen, unmapped, or GPS-denied settings, traditional metric map-based methods fall short, prompting a shift toward learning-based approaches with minimal exploration. In this work, we develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels. Our method learns to select subgoals through a simple, transferable waypoint selection network. A key component of the approach is a latent-space memory module organized solely by visual similarity, as a proxy for distance. This alternative to graph-based topological representations proves sufficient for navigation tasks, providing a compact, light-weight, simple-to-train navigator that can find its way to the goal in novel locations. We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference. An additional contribution leverages the interpretablility of the framework for interactive navigation. We consider the question: how much direction intervention/interaction is needed to achieve success in all trials? We demonstrate that even minimal human involvement can significantly enhance overall navigation performance.

235. 【2602.06968】Learning to Anchor Visual Odometry: KAN-Based Pose Regression for Planetary Landing

链接https://arxiv.org/abs/2602.06968

作者:Xubo Luo,Zhaojin Li,Xue Wan,Wei Zhang,Leizheng Shu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:approaches remain limited, existing approaches remain, absolute localization fails, map-based absolute localization, visual odometry

备注: 8 pages, accepted by RA-L

点击查看摘要

Abstract:Accurate and real-time 6-DoF localization is mission-critical for autonomous lunar landing, yet existing approaches remain limited: visual odometry (VO) drifts unboundedly, while map-based absolute localization fails in texture-sparse or low-light terrain. We introduce KANLoc, a monocular localization framework that tightly couples VO with a lightweight but robust absolute pose regressor. At its core is a Kolmogorov-Arnold Network (KAN) that learns the complex mapping from image features to map coordinates, producing sparse but highly reliable global pose anchors. These anchors are fused into a bundle adjustment framework, effectively canceling drift while retaining local motion precision. KANLoc delivers three key advances: (i) a KAN-based pose regressor that achieves high accuracy with remarkable parameter efficiency, (ii) a hybrid VO-absolute localization scheme that yields globally consistent real-time trajectories (=15 FPS), and (iii) a tailored data augmentation strategy that improves robustness to sensor occlusion. On both realistic synthetic and real lunar landing datasets, KANLoc reduces average translation and rotation error by 32% and 45%, respectively, with per-trajectory gains of up to 45%/48%, outperforming strong baselines.

236. 【2512.22730】Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

链接https://arxiv.org/abs/2512.22730

作者:Youssef Megahed,Robin Ducharme,Inok Lee,Inbal Willner,Adrian D. C. Chan,Mark Walker,Steven Hawken

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:adverse pregnancy outcomes, portends high rates, high-risk prenatal ultrasound, prenatal ultrasound finding, structural malformations

备注: 13 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).

237. 【2602.08764】Efficient Brain Extraction of MRI Scans with Mild to Moderate Neuropathology

链接https://arxiv.org/abs/2602.08764

作者:Hjalti Thrastarson,Lotta M. Ellingsen

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:stripping magnetic resonance, Skull stripping magnetic, image processing techniques, magnetic resonance images, processing techniques

备注: Accepted for publication in the Proceedings of SPIE Medical Imaging 2026

点击查看摘要

Abstract:Skull stripping magnetic resonance images (MRI) of the human brain is an important process in many image processing techniques, such as automatic segmentation of brain structures. Numerous methods have been developed to perform this task, however, they often fail in the presence of neuropathology and can be inconsistent in defining the boundary of the brain mask. Here, we propose a novel approach to skull strip T1-weighted images in a robust and efficient manner, aiming to consistently segment the outer surface of the brain, including the sulcal cerebrospinal fluid (CSF), while excluding the full extent of the subarachnoid space and meninges. We train a modified version of the U-net on silver-standard ground truth data using a novel loss function based on the signed-distance transform (SDT). We validate our model both qualitatively and quantitatively using held-out data from the training dataset, as well as an independent external dataset. The brain masks used for evaluation partially or fully include the subarachnoid space, which may introduce bias into the comparison; nonetheless, our model demonstrates strong performance on the held-out test data, achieving a consistent mean Dice similarity coefficient (DSC) of 0.964$\pm$0.006 and an average symmetric surface distance (ASSD) of 1.4mm$\pm$0.2mm. Performance on the external dataset is comparable, with a DSC of 0.958$\pm$0.006 and an ASSD of 1.7$\pm$0.2mm. Our method achieves performance comparable to or better than existing state-of-the-art methods for brain extraction, particularly in its highly consistent preservation of the brain's outer surface. The method is publicly available on GitHub.

238. 【2602.08580】retinalysis-vascx: An explainable software toolbox for the extraction of retinal vascular biomarkers

链接https://arxiv.org/abs/2602.08580

作者:Jose D. Vargas Quiros,Michael J. Beyeler,Sofia Ortin Vela,EyeNED Reading Center,Sven Bergmann,Caroline C.W. Klaver,Bart Liefers

类目:Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV)

关键词:color fundus images, color fundus, automatic extraction, open-source Python toolbox, Python toolbox designed

备注

点击查看摘要

Abstract:The automatic extraction of retinal vascular biomarkers from color fundus images (CFI) is essential for large-scale studies of the retinal vasculature. We present VascX, an open-source Python toolbox designed for the automated extraction of biomarkers from artery and vein segmentations. The VascX workflow processes vessel segmentation masks into skeletons to build undirected and directed vessel graphs, which are then used to resolve segments into continuous vessels. This architecture enables the calculation of a comprehensive suite of biomarkers, including vascular density, bifurcation angles, central retinal equivalents (CREs), tortuosity, and temporal angles, alongside image quality metrics. A distinguishing feature of VascX is its region awareness; by utilizing the fovea, optic disc, and CFI boundaries as anatomical landmarks, the tool ensures spatially standardized measurements and identifies when specific biomarkers are not computable. Spatially localized biomarkers are calculated over grids relative to these landmarks, facilitating precise clinical analysis. Released via GitHub and PyPI, VascX provides an explainable and modifiable framework that supports reproducible vascular research through integrated visualizations. By enabling the rapid extraction of established biomarkers and the development of new ones, VascX advances the field of oculomics, offering a robust, computationally efficient solution for scalable deployment in large-scale clinical and epidemiological databases.

Subjects:

Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.08580 [q-bio.TO]

(or
arXiv:2602.08580v1 [q-bio.TO] for this version)

https://doi.org/10.48550/arXiv.2602.08580

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
239. 【2602.08249】A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

链接https://arxiv.org/abs/2602.08249

作者:Weijie Gan,Xucheng Wang,Tongyao Wang,Wenshang Wang,Chunwei Ying,Yuyang Hu,Yasheng Chen,Hongyu An,Ulugbek S. Kamilov

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:complicating training, deployment workflows, handling incomplete multimodal, incomplete multimodal imaging, existing methods require

备注

点击查看摘要

Abstract:Image reconstruction and image synthesis are important for handling incomplete multimodal imaging data, but existing methods require various task-specific models, complicating training and deployment workflows. We introduce Any2all, a unified framework that addresses this limitation by formulating these disparate tasks as a single virtual inpainting problem. We train a single, unconditional diffusion model on the complete multimodal data stack. This model is then adapted at inference time to ``inpaint'' all target modalities from any combination of inputs of available clean images or noisy measurements. We validated Any2all on a PET/MR/CT brain dataset. Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks, consistently yielding images with competitive distortion-based performance and superior perceptual quality over specialized methods.

240. 【2602.08029】Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

链接https://arxiv.org/abs/2602.08029

作者:Berthy T. Feng,Andrew A. Chael,David Bromley,Aviad Levis,William T. Freeman,Katherine L. Bouman

类目:General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)

关键词:static black-hole imaging, black hole, black-hole imaging, success of static, static black-hole

备注

点击查看摘要

Abstract:With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose PI-DEF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a physics-agnostic approach. We demonstrate how our method may be used to estimate other physics parameters of the black hole, such as its spin.

241. 【2602.07819】DINO-Mix: Distilling Foundational Knowledge with Cross-Domain CutMix for Semi-supervised Class-imbalanced Medical Image Segmentation

链接https://arxiv.org/abs/2602.07819

作者:Xinyu Liu,Guolei Sun

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:mitigating the immense, dense annotations, prevailing SSL frameworks, immense cost, cost of dense

备注: AAAI 2026 Workshop on Artificial Intelligence with Biased or Scarce Data (Oral)

点击查看摘要

Abstract:Semi-supervised learning (SSL) has emerged as a critical paradigm for medical image segmentation, mitigating the immense cost of dense annotations. However, prevailing SSL frameworks are fundamentally "inward-looking", recycling information and biases solely from within the target dataset. This design triggers a vicious cycle of confirmation bias under class imbalance, leading to the catastrophic failure to recognize minority classes. To dismantle this systemic issue, we propose a paradigm shift to a multi-level "outward-looking" framework. Our primary innovation is Foundational Knowledge Distillation (FKD), which looks outward beyond the confines of medical imaging by introducing a pre-trained visual foundation model, DINOv3, as an unbiased external semantic teacher. Instead of trusting the student's biased high confidence, our method distills knowledge from DINOv3's robust understanding of high semantic uniqueness, providing a stable, cross-domain supervisory signal that anchors the learning of minority classes. To complement this core strategy, we further look outward within the data by proposing Progressive Imbalance-aware CutMix (PIC), which creates a dynamic curriculum that adaptively forces the model to focus on minority classes in both labeled and unlabeled subsets. This layered strategy forms our framework, DINO-Mix, which breaks the vicious cycle of bias and achieves remarkable performance on challenging semi-supervised class-imbalanced medical image segmentation benchmarks Synapse and AMOS.

242. 【2602.07570】How does longer temporal context enhance multimodal narrative video processing in the brain?

链接https://arxiv.org/abs/2602.07570

作者:Prachi Jindal,Anant Khandelwal,Manish Gupta,Bapi S. Raju,Subba Reddy Oota,Tanmoy Chakraborty

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:artificial intelligence systems, intelligence systems process, systems process complex, Understanding how humans, process complex narrative

备注: 22 pages, 15 figures

点击查看摘要

Abstract:Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.

243. 【2602.07403】Surveillance Facial Image Quality Assessment: A Multi-dimensional Dataset and Lightweight Model

链接https://arxiv.org/abs/2602.07403

作者:Yanwei Jiang,Wei Sun,Yingjie Zhou,Xiangyang Zhu,Yuqin Cao,Jun Jia,Yunhao Li,Sijing Wu,Dandan Zhu,Xingkuo Min,Guangtao Zhai

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Surveillance facial images, severe quality degradation, quality degradation due, image quality assessment, facial image quality

备注

点击查看摘要

Abstract:Surveillance facial images are often captured under unconstrained conditions, resulting in severe quality degradation due to factors such as low resolution, motion blur, occlusion, and poor lighting. Although recent face restoration techniques applied to surveillance cameras can significantly enhance visual quality, they often compromise fidelity (i.e., identity-preserving features), which directly conflicts with the primary objective of surveillance images -- reliable identity verification. Existing facial image quality assessment (FIQA) predominantly focus on either visual quality or recognition-oriented evaluation, thereby failing to jointly address visual quality and fidelity, which are critical for surveillance applications. To bridge this gap, we propose the first comprehensive study on surveillance facial image quality assessment (SFIQA), targeting the unique challenges inherent to surveillance scenarios. Specifically, we first construct SFIQA-Bench, a multi-dimensional quality assessment benchmark for surveillance facial images, which consists of 5,004 surveillance facial images captured by three widely deployed surveillance cameras in real-world scenarios. A subjective experiment is conducted to collect six dimensional quality ratings, including noise, sharpness, colorfulness, contrast, fidelity and overall quality, covering the key aspects of SFIQA. Furthermore, we propose SFIQA-Assessor, a lightweight multi-task FIQA model that jointly exploits complementary facial views through cross-view feature interaction, and employs learnable task tokens to guide the unified regression of multiple quality dimensions. The experiment results on the proposed dataset show that our method achieves the best performance compared with the state-of-the-art general image quality assessment (IQA) and FIQA methods, validating its effectiveness for real-world surveillance applications.

244. 【2602.07393】Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction

链接https://arxiv.org/abs/2602.07393

作者:Yang Zhang,Zhangkai Ni,Wenhan Yang,Hanli Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:High Dynamic Range, Low Dynamic Range, Dynamic Range, HDR video reconstruction, High Dynamic

备注

点击查看摘要

Abstract:High Dynamic Range (HDR) video reconstruction aims to recover fine brightness, color, and details from Low Dynamic Range (LDR) videos. However, existing methods often suffer from color inaccuracies and temporal inconsistencies. To address these challenges, we propose WMNet, a novel HDR video reconstruction network that leverages Wavelet domain Masked Image Modeling (W-MIM). WMNet adopts a two-phase training strategy: In Phase I, W-MIM performs self-reconstruction pre-training by selectively masking color and detail information in the wavelet domain, enabling the network to develop robust color restoration capabilities. A curriculum learning scheme further refines the reconstruction process. Phase II fine-tunes the model using the pre-trained weights to improve the final reconstruction quality. To improve temporal consistency, we introduce the Temporal Mixture of Experts (T-MoE) module and the Dynamic Memory Module (DMM). T-MoE adaptively fuses adjacent frames to reduce flickering artifacts, while DMM captures long-range dependencies, ensuring smooth motion and preservation of fine details. Additionally, since existing HDR video datasets lack scene-based segmentation, we reorganize HDRTV4K into HDRTV4K-Scene, establishing a new benchmark for HDR video reconstruction. Extensive experiments demonstrate that WMNet achieves state-of-the-art performance across multiple evaluation metrics, significantly improving color fidelity, temporal coherence, and perceptual quality. The code is available at: this https URL

245. 【2602.07233】Extracting Root-Causal Brain Activity Driving Psychopathology from Resting State fMRI

链接https://arxiv.org/abs/2602.07233

作者:Eric V. Strobl

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词:obscure underlying mechanisms, yielding diffuse associations, correlate imaging patterns, Neuroimaging studies, composite symptom scores

备注

点击查看摘要

Abstract:Neuroimaging studies of psychiatric disorders often correlate imaging patterns with diagnostic labels or composite symptom scores, yielding diffuse associations that obscure underlying mechanisms. We instead seek to identify root-causal maps -- localized BOLD disturbances that initiate pathological cascades -- and to link them selectively to symptom dimensions. We introduce a bilevel structural causal model that connects between-subject symptom structure to within-subject resting-state fMRI via independent latent sources with localized direct effects. Based on this model, we develop SOURCE (Symptom-Oriented Uncovering of Root-Causal Elements), a procedure that links interpretable symptom axes to a parsimonious set of localized drivers. Experiments show that SOURCE recovers localized maps consistent with root-causal BOLD drivers and increases interpretability and anatomical specificity relative to existing comparators.

246. 【2602.07094】Exploring Polarimetric Properties Preservation during Reconstruction of PolSAR images using Complex-valued Convolutional Neural Networks

链接https://arxiv.org/abs/2602.07094

作者:Quentin Gabot,Joana Frontera-Pons,Jérémy Fix,Chengfang Ren,Jean-Philippe Ovarlez

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:specialized algorithms capable, Polarimetric SAR data, inherently complex-valued nature, SAR data necessitates, Polarimetric SAR

备注: Accepted with minor revisions at IET Radar, Sonar Navigation

点击查看摘要

Abstract:The inherently complex-valued nature of Polarimetric SAR data necessitates using specialized algorithms capable of directly processing complex-valued representations. However, this aspect remains underexplored in the deep learning community, with many studies opting to convert complex signals into the real domain before applying conventional real-valued models. In this work, we leverage complex-valued neural networks and investigate the performance of complex-valued Convolutional AutoEncoders. We show that these networks can effectively compress and reconstruct fully polarimetric SAR data while preserving essential physical characteristics, as demonstrated through Pauli, Krogager, and Cameron coherent decompositions, as well as the non-coherent $H-\alpha$ decomposition. Finally, we highlight the advantages of complex-valued neural networks over their real-valued counterparts. These insights pave the way for developing robust, physics-informed, complex-valued generative models for SAR data processing.

247. 【2602.07068】MRI Cross-Modal Synthesis: A Comparative Study of Generative Models for T1-to-T2 Reconstruction

链接https://arxiv.org/abs/2602.07068

作者:Ali Alqutayfi,Sadam Al-Azani

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:offering considerable clinical, maintaining diagnostic information, reducing scan time, involves generating images, cross-modal synthesis involves

备注

点击查看摘要

Abstract:MRI cross-modal synthesis involves generating images from one acquisition protocol using another, offering considerable clinical value by reducing scan time while maintaining diagnostic information. This paper presents a comprehensive comparison of three state-of-the-art generative models for T1-to-T2 MRI reconstruction: Pix2Pix GAN, CycleGAN, and Variational Autoencoder (VAE). Using the BraTS 2020 dataset (11,439 training and 2,000 testing slices), we evaluate these models based on established metrics including Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM). Our experiments demonstrate that all models can successfully synthesize T2 images from T1 inputs, with CycleGAN achieving the highest PSNR (32.28 dB) and SSIM (0.9008), while Pix2Pix GAN provides the lowest MSE (0.005846). The VAE, though showing lower quantitative performance (MSE: 0.006949, PSNR: 24.95 dB, SSIM: 0.6573), offers advantages in latent space representation and sampling capabilities. This comparative study provides valuable insights for researchers and clinicians selecting appropriate generative models for MRI synthesis applications based on their specific requirements and data constraints.

248. 【2602.07060】U-Net Based Image Enhancement for Short-time Muon Scattering Tomography

链接https://arxiv.org/abs/2602.07060

作者:Haochen Wang,Pei Yu,Liangwen Chen,Weibo He,Yu Zhang,Yuhong Yu,Xueheng Zhang,Lei Yang,Zhiyu Sun

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det); Medical Physics (physics.med-ph)

关键词:Muon Scattering Tomography, limited muon flux, Scattering Tomography, non-invasive inspection technique, promising non-invasive inspection

备注

点击查看摘要

Abstract:Muon Scattering Tomography (MST) is a promising non-invasive inspection technique, yet the practical application of short-time MST is hindered by poor image quality due to limited muon flux. To address this limitation, we propose a U-Net-based framework trained on Point of Closest Approach (PoCA) images reconstructed with simulation MST data to enhance image quality. When applied to experimental MST data, the framework significantly improves image quality, increasing the Structural Similarity Index Measure (SSIM) from 0.7232 to 0.9699 and decreasing the Learned Perceptual Image Patch Similarity (LPIPS) from 0.3604 to 0.0270. These results demonstrate that our method can effectively enhance low-statistics MST images, thereby paving the way for the practical deployment of short-time MST.

249. 【2602.07056】MTS-CSNet: Multiscale Tensor Factorization for Deep Compressive Sensing on RGB Images

链接https://arxiv.org/abs/2602.07056

作者:Mehmet Yamac,Lei Xu,Serkan Kiranyaz,Moncef Gabbouj

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning based, fully connected layers, high dimensional data, typically learn sampling, block wise fully

备注: 6 pages, 5 figures

点击查看摘要

Abstract:Deep learning based compressive sensing (CS) methods typically learn sampling operators using convolutional or block wise fully connected layers, which limit receptive fields and scale poorly for high dimensional data. We propose MTSCSNet, a CS framework based on Multiscale Tensor Summation (MTS) factorization, a structured operator for efficient multidimensional signal processing. MTS performs mode-wise linear transformations with multiscale summation, enabling large receptive fields and effective modeling of cross-dimensional correlations. In MTSCSNet, MTS is first used as a learnable CS operator that performs linear dimensionality reduction in tensor space, with its adjoint defining the initial back-projection, and is then applied in the reconstruction stage to directly refine this estimate. This results in a simple feed-forward architecture without iterative or proximal optimization, while remaining parameter and computation efficient. Experiments on standard CS benchmarks show that MTSCSNet achieves state-of-the-art reconstruction performance on RGB images, with notable PSNR gains and faster inference, even compared to recent diffusion-based CS methods, while using a significantly more compact feed-forward architecture.

250. 【2602.07029】Guidestar-Free Adaptive Optics with Asymmetric Apertures

链接https://arxiv.org/abs/2602.07029

作者:Weiyun Jiang,Haiyun Guo,Christopher A. Metzler,Ashok Veeraraghavan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:closed-loop adaptive optics, optically correcting aberrations, adaptive optics, system capable, closed-loop adaptive

备注

点击查看摘要

Abstract:This work introduces the first closed-loop adaptive optics (AO) system capable of optically correcting aberrations in real-time without a guidestar or a wavefront sensor. Nearly 40 years ago, Cederquist et al. demonstrated that asymmetric apertures enable phase retrieval (PR) algorithms to perform fully computational wavefront sensing, albeit at a high computational cost. More recently, Chimitt et al. extended this approach with machine learning and demonstrated real-time wavefront sensing using only a single (guidestar-based) point-spread-function (PSF) measurement. Inspired by these works, we introduce a guidestar-free AO framework built around asymmetric apertures and machine learning. Our approach combines three key elements: (1) an asymmetric aperture placed in the optical path that enables PR-based wavefront sensing, (2) a pair of machine learning algorithms that estimate the PSF from natural scene measurements and reconstruct phase aberrations, and (3) a spatial light modulator that performs optical correction. We experimentally validate this framework on dense natural scenes imaged through unknown obscurants. Our method outperforms state-of-the-art guidestar-free wavefront shaping methods, using an order of magnitude fewer measurements and three orders of magnitude less computation.

251. 【2602.07022】Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

链接https://arxiv.org/abs/2602.07022

作者:Yucheng Zhou,Hao Li,Jianbing Shen

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent studies, optimize image generation, autoregressive models, combined diffusion models, explored autoregressive models

备注: ICLR 2026

点击查看摘要

Abstract:Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency''. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.

252. 【2602.06994】SurfAge-Net: A Hierarchical Surface-Based Network for Interpretable Fine-Grained Brain Age Prediction

链接https://arxiv.org/abs/2602.06994

作者:Rongzhao He,Dalin Zhu,Ying Wang,Songhong Yue,Leilei Zhao,Yu Fu,Dan Wu,Bin Hu,Weihao Zheng

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Brain age prediction, assessing brain status, age prediction serves, age prediction, Brain age

备注

点击查看摘要

Abstract:Brain age prediction serves as a powerful framework for assessing brain status and detecting deviations associated with neurodevelopmental and neurodegenerative disorders. However, most existing approaches emphasize whole-brain age prediction and therefore overlook the pronounced regional heterogeneity of brain maturation that is crucial for detecting localized atypical trajectories. To address this limitation, we propose a novel spherical surface-based brain age prediction network (SurfAge-Net) that leverages multiple morphological metrics to capture region-specific developmental patterns with enhanced robustness and clinical interpretability. SurfAge-Net establishes a new modeling paradigm by incorporating the connectomic principles of cortical organization: it explicitly models both intra- and inter-hemispheric dependencies through a spatial-channel mixing and a lateralization-aware attention mechanism, enabling the network to characterize the coordinate maturation pattern uniquely associated with each target region. Validated on three fetal and neonatal datasets, SurfAge-Net outperforms existing approaches (global MAE = 0.54, regional MAE = 0.45 in gestational/postmenstrual weeks) and demonstrates strong generalizability across external cohorts. Importantly, it provides spatially precise and biologically interpretable maps of cortical maturation, effectively identifying heterogeneous delays and regional-specific abnormalities in atypical developmental populations. These results established fine-grained brain age prediction as a promising paradigm for advancing neurodevelopmental research and supporting early clinical assessment.