本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1680篇论文,其中:

  • 自然语言处理245
  • 信息检索45
  • 计算机视觉343

自然语言处理

1. 【2602.02495】Reward-free Alignment for Conflicting Objectives

链接https://arxiv.org/abs/2602.02495

作者:Peter Chen,Xiaopeng Li,Xi Chen,Tianyi Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:align large language, large language models, Direct alignment methods, Direct alignment, align large

备注: 27 pages

点击查看摘要

Abstract:Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

2. 【2602.02488】RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

链接https://arxiv.org/abs/2602.02488

作者:Yinjie Wang,Tianbao Xie,Ke Shen,Mengdi Wang,Ling Yang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:reinforcement learning framework, dynamically forges environment, amplifying learning signals, closed-loop optimization, framework that dynamically

备注: Code: [this https URL](https://github.com/Gen-Verse/Open-AgentRL)

点击查看摘要

Abstract:We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL

3. 【2602.02486】RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

链接https://arxiv.org/abs/2602.02486

作者:Jialiang Zhu,Gongrui Zhang,Xiaolong Ma,Lin Xu,Miaosen Zhang,Ruiqi Yang,Song Wang,Kai Qiu,Zhirong Wu,Qi Dai,Ruichun Ma,Bei Liu,Yifan Yang,Chong Luo,Zhengyuan Yang,Linjie Li,Lijuan Wang,Weizhu Chen,Xin Geng,Baining Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM-based deep research, deep research agents, LLM-based deep, agents are largely, largely built

备注

点击查看摘要

Abstract:LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.

4. 【2602.02477】raining LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

链接https://arxiv.org/abs/2602.02477

作者:Xiao Liang,Zhong-Zhi Li,Zhenghao Lin,Eric Hancheng Jiang,Hengyuan Zhang,Yelong Shen,Kai-Wei Chang,Ying Nian Wu,Yeyun Gong,Weizhu Chen

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, demonstrated strong reasoning, demonstrated strong, strong reasoning capabilities

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

5. 【2602.02474】MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

链接https://arxiv.org/abs/2602.02474

作者:Haozhen Zhang,Quanyu Long,Jianzhu Bao,Tao Feng,Weizhi Zhang,Haodong Yue,Wenya Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Model, Language Model, Large Language, memory systems rely, systems rely

备注: Code is available at [this https URL](https://github.com/ViktorAxelsen/MemSkill)

点击查看摘要

Abstract:Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

6. 【2602.02472】SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

链接https://arxiv.org/abs/2602.02472

作者:Qifan Yu,Xinyu Ma,Zhijian Zhuo,Minrui Wang,Deyi Liu,Shiyi Zhan,Yiyuan Ma,Liang Xiang,Xingyan Bin,Di He

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:pre-training computational overhead, Progressive Learning, increasing model scale, gradually increasing model, overhead by gradually

备注

点击查看摘要

Abstract:Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing {S}ignal {P}reservation {A}nd symmet{R}y brea{K}ing for width-progressive {L}earn{ING}), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under $2\times$ width expansion.

7. 【2602.02468】Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

链接https://arxiv.org/abs/2602.02468

作者:Aiden Yiliu Li,Xinyue Hao,Shilong Liu,Mengdi Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multimodal large language, reliably execute long-horizon, large language models, Document Object Model, execute long-horizon tasks

备注

点击查看摘要

Abstract:Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.

8. 【2602.02467】Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models

链接https://arxiv.org/abs/2602.02467

作者:Noam Steinmetz Yalon,Ariel Goldstein,Liad Mudrik,Mor Geva

类目:Computation and Language (cs.CL)

关键词:Rapid advancements, large language models, advancements in large, large language, sparked the question

备注

点击查看摘要

Abstract:Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model's latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model's action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.

9. 【2602.02464】From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

链接https://arxiv.org/abs/2602.02464

作者:Or Shafran,Shaked Ronen,Omri Fahn,Shauli Ravfogel,Atticus Geiger,Mor Geva

类目:Computation and Language (cs.CL)

关键词:activation space, tightly coupled, Activation decomposition methods, Activation, Factor Analyzers

备注

点击查看摘要

Abstract:Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

10. 【2602.02462】Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.02462

作者:Gabriele Maraia,Marco Valentino,Fabio Massimo Zanzotto,Leonardo Ranaldi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, systematically conflating semantic, conflating semantic plausibility, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model's internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model's activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.

11. 【2602.02455】Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

链接https://arxiv.org/abs/2602.02455

作者:Han Bao,Zheyuan Zhang,Pengcheng Jing,Zhengqing Yuan,Kaiwen Shi,Yanfang Ye

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Language Models transition, Large Language, Language Models, Models transition

备注: 65 pages, 40 figures

点击查看摘要

Abstract:As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.

12. 【2602.02440】Large Language Models for Mental Health: A Multilingual Evaluation

链接https://arxiv.org/abs/2602.02440

作者:Nishat Raihan,Sadiya Sayara Chowdhury Puspo,Ana-Maria Bucur,Stevie Chancellor,Marcos Zampieri

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, remarkable capabilities, mental health

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.

13. 【2602.02414】Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank

链接https://arxiv.org/abs/2602.02414

作者:Joshua Mitton,Prarthana Bhattacharyya,Digory Smith,Thomas Christie,Ralph Abboud,Simon Woodhead

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:improving learning outcomes, Timely and accurate, student errors, identification of student, compounding of student

备注: 21 pages, 8 figures, 8 tables. Joshua Mitton and Prarthana Bhattacharyya contributed equally to this paper

点击查看摘要

Abstract:Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.

14. 【2602.02382】ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs

链接https://arxiv.org/abs/2602.02382

作者:Ziyan Zhang,Chao Wang,Zhuo Chen,Chiyi Li,Kai Song

类目:Computation and Language (cs.CL)

关键词:Answering first-order logic, incomplete knowledge graphs, Answering first-order, first-order logic, knowledge graphs

备注

点击查看摘要

Abstract:Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.

15. 【2602.02378】From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making

链接https://arxiv.org/abs/2602.02378

作者:Raunak Jain,Mudita Khurana,John Stephens,Srinivas Dharmasanam,Shankar Venkataraman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:dangerous pattern emerges, pattern emerges, calibrated judgment, LLMs expand, expand from assistance

备注

点击查看摘要

Abstract:As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.

16. 【2602.02377】Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

链接https://arxiv.org/abs/2602.02377

作者:Haotong Yang,Zitong Wang,Shijia Kang,Siqi Yang,Wenkai Yu,Xu Niu,Yike Sun,Yi Hu,Zhouchen Lin,Muhan Zhang

类目:Computation and Language (cs.CL)

关键词:simple answer matching, Reinforcement Learning, math reasoning abilities, Large Language Models, abilities through Reinforcement

备注: Under review

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "**question-proof-check**" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.

17. 【2602.02360】Automated Multiple Mini Interview (MMI) Scoring

链接https://arxiv.org/abs/2602.02360

作者:Ryan Huynh,Frank Guerin,Alison Callwood

类目:Computation and Language (cs.CL)

关键词:Assessing soft skills, competitive selection processes, Assessing soft, ethical judgment, selection processes

备注: 18 pages, 2 figures

点击查看摘要

Abstract:Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.

18. 【2602.02343】Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

链接https://arxiv.org/abs/2602.02343

作者:Ziwen Xu,Chenyan Wu,Hengyu Sun,Haiwen Hong,Mengru Wang,Yunzhi Yao,Longtao Huang,Hui Xue,Shumin Deng,Zhixuan Chu,Huajun Chen,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:making comparison difficult, controlling large language, including local weight, local weight fine-tuning, large language models

备注: Work in progress

点击查看摘要

Abstract:Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at this https URL.

19. 【2602.02326】Language Steering for Multilingual In-Context Learning

链接https://arxiv.org/abs/2602.02326

作者:Neeraja Kirtane,Kuan-Hao Huang

类目:Computation and Language (cs.CL)

关键词:gained widespread adoption, remains substantially inferior, languages remains substantially, widespread adoption, non-English languages remains

备注

点击查看摘要

Abstract:While multilingual large language models have gained widespread adoption, their performance on non-English languages remains substantially inferior to English. This disparity is particularly evident in in-context learning scenarios, where providing demonstrations in English but testing on non-English inputs leads to significant performance degradation. In this paper, we hypothesize that LLMs develop a universal semantic space for understanding languages, where different languages are encoded as distinct directions within this space. Based on this hypothesis, we propose language vectors -- a training-free language steering approach that leverages activation differences between source and target languages to guide model behavior. We steer the model generations by adding the vector to the intermediate model activations during inference. This is done to make the model's internal representations shift towards the target language space without any parameter updates. We evaluate our method across three datasets and test on a total of 19 languages on three different models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested. Beyond performance gains, hierarchical clustering of steering vectors reveals meaningful linguistic structure aligned with language families. These vectors also successfully transfer across tasks, demonstrating that these representations are task-agnostic.

20. 【2602.02320】A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

链接https://arxiv.org/abs/2602.02320

作者:Feiyang Cai,Guijuan He,Yi Hu,Jingjing Wang,Joshua Luo,Tianyu Zhu,Srikanth Pilla,Gang Li,Ling Liu,Feng Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)

关键词:function is largely, largely determined, molecular structure, Molecular function, Molecular

备注

点击查看摘要

Abstract:Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.

21. 【2602.02315】he Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors

链接https://arxiv.org/abs/2602.02315

作者:Raphaël Sarfati,Eric Bigelow,Daniel Wurgaft,Jack Merullo,Atticus Geiger,Owen Lewis,Tom McGrath,Ekdeep Singh Lubana

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, represent prompt-conditioned beliefs, represent prompt-conditioned, posteriors over answers

备注

点击查看摘要

Abstract:Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved "belief manifolds" for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.

22. 【2602.02313】Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

链接https://arxiv.org/abs/2602.02313

作者:Changming Li,Kaixing Zhang,Haoyun Xu,Yingdong Shi,Zheng Zhang,Kaitao Song,Kan Ren

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, complex real-world problems, Large language, solving complex real-world, strong reasoning abilities

备注

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.

23. 【2602.02301】Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

链接https://arxiv.org/abs/2602.02301

作者:Min Cai,Yu Liang,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Xi Ye,Daiting Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Reinforcement learning, yielding strong gains, large reasoning models, yielding strong, large reasoning

备注: Preprint; Code: [this https URL](https://github.com/StringNLPLAB/MGS;) Website: [this https URL](https://modular-gradient-surgery.github.io)

点击查看摘要

Abstract:Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce **M**odular **G**radient **S**urgery (**MGS**), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.

24. 【2602.02290】Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

链接https://arxiv.org/abs/2602.02290

作者:Alex Argese,Pasquale Lisena,Raphaël Troncy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:stories remains challenging, diverse audiences, remains challenging, turn scientific articles, Generative

备注

点击查看摘要

Abstract:Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

25. 【2602.02287】Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

链接https://arxiv.org/abs/2602.02287

作者:Isaac Chung,Linda Freienthal

类目:Computation and Language (cs.CL)

关键词:genuine model performance, Cross-lingual evaluation, typically conflates, sources of variance, measurement instability

备注: First Workshop on Multilingual Multicultural Evaluation, co-located with EACL 2026

点击查看摘要

Abstract:Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at this https URL.

Comments:
First Workshop on Multilingual Multicultural Evaluation, co-located with EACL 2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.02287 [cs.CL]

(or
arXiv:2602.02287v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.02287

Focus to learn more

              arXiv-issued DOI via DataCite</p>
26. 【2602.02285】Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

链接https://arxiv.org/abs/2602.02285

作者:Yuanhe Zhang,Jason D. Lee,Fanghui Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST)

关键词:Gaussian Lipschitz concentration, statistical learning theory, empirical process theory, comprehensive Lean, grounded in empirical

备注: 19 pages, 2 figures. Comments are welcome

点击查看摘要

Abstract:We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at this https URL

27. 【2602.02280】RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

链接https://arxiv.org/abs/2602.02280

作者:Zeming Wei,Zhixin Zhang,Chengcan Wu,Yihao Zhang,Xiaokun Luan,Meng Sun

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Recent advancements, led to significant, significant breakthroughs, RACA, Recent

备注

点击查看摘要

Abstract:Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.

28. 【2602.02276】Kimi K2.5: Visual Agentic Intelligence

链接https://arxiv.org/abs/2602.02276

作者:Kimi Team:Tongtong Bai,Yifan Bai,Yiping Bao,S.H. Cai,Yuan Cao,Y. Charles,H.S. Che,Cheng Chen,Guanduo Chen,Huarong Chen,Jia Chen,Jiahao Chen,Jianlong Chen,Jun Chen,Kefan Chen,Liang Chen,Ruijue Chen,Xinhao Chen,Yanru Chen,Yanxu Chen,Yicun Chen,Yimin Chen,Yingjiang Chen,Yuankun Chen,Yujie Chen,Yutian Chen,Zhirong Chen,Ziwei Chen,Dazhi Cheng,Minghan Chu,Jialei Cui,Jiaqi Deng,Muxi Diao,Hao Ding,Mengfan Dong,Mengnan Dong,Yuxin Dong,Yuhao Dong,Angang Du,Chenzhuang Du,Dikang Du,Lingxiao Du,Yulun Du,Yu Fan,Shengjun Fang,Qiulin Feng,Yichen Feng,Garimugai Fu,Kelin Fu,Hongcheng Gao,Tong Gao,Yuyao Ge,Shangyi Geng,Chengyang Gong,Xiaochen Gong,Zhuoma Gongque,Qizheng Gu,Xinran Gu,Yicheng Gu,Longyu Guan,Yuanying Guo,Xiaoru Hao,Weiran He,Wenyang He,Yunjia He,Chao Hong,Hao Hu,Jiaxi Hu,Yangyang Hu,Zhenxing Hu,Ke Huang,Ruiyuan Huang,Weixiao Huang,Zhiqi Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yu Jing,Guokun Lai,Aidi Li,C. Li,Cheng Li,Fang Li,Guanghe Li,Guanyu Li,Haitao Li,Haoyang Li,Jia Li,Jingwei Li,Junxiong Li,Lincan Li,Mo Li,Weihong Li,Wentao Li,Xinhang Li,Xinhao Li,Yang Li,Yanhao Li,Yiwei Li

类目:Computation and Language (cs.CL)

关键词:advance general agentic, designed to advance, advance general, general agentic intelligence, Kimi

备注: Kimi K2.5 tech report

点击查看摘要

Abstract:We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

29. 【2602.02270】dziribot: rag based intelligent conversational agent for algerian arabic dialect

链接https://arxiv.org/abs/2602.02270

作者:El Batoul Bechiri,Dihia Lanasri

类目:Computation and Language (cs.CL)

关键词:conversational agents capable, rapid digitalization, digitalization of customer, intensified the demand, capable of providing

备注

点击查看摘要

Abstract:The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.

30. 【2602.02266】OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data

链接https://arxiv.org/abs/2602.02266

作者:Tan Sang Nguyen,Muhammad Reza Qorib,Hwee Tou Ng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, Large language models, Large language, wide range, range of natural

备注

点击查看摘要

Abstract:Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.

31. 【2602.02262】OmniCode: A Benchmark for Evaluating Software Engineering Agents

链接https://arxiv.org/abs/2602.02262

作者:Atharv Sonwane,Eng-Shen Tu,Wei-Chung Lu,Claas Beger,Carter Larsen,Debjit Dhar,Rachel Chen,Ronit Pattanayak,Tuan Anh Dang,Guohao Chen,Gloria Geng,Kevin Ellis,Saikat Dutta

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:test generation, LLM-powered coding agents, software, tasks, Java Test Generation

备注

点击查看摘要

Abstract:LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 20.9% with DeepSeek-V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at this https URL.

32. 【2602.02244】Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

链接https://arxiv.org/abs/2602.02244

作者:Hao Wang,Hao Gu,Hongming Piao,Kaixiong Gong,Yuxiao Ye,Xiangyu Yue,Sirui Han,Yike Guo,Dapeng Wu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:imitates expert demonstrations, reduces generation diversity, standard post-training recipe, narrowed solution space, SFT imitates expert

备注

点击查看摘要

Abstract:The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.

33. 【2602.02221】Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

链接https://arxiv.org/abs/2602.02221

作者:Frederic Blum,Johann-Mattis List

类目:Computation and Language (cs.CL)

关键词:Regular sound correspondences, Regular sound, sound correspondences constitute, constitute the principal, principal evidence

备注: Accepted for the L'Change workshop @ EACL 2026

点击查看摘要

Abstract:Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85\% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.

34. 【2602.02219】Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

链接https://arxiv.org/abs/2602.02219

作者:Yuzheng Xu,Tosho Hirasawa,Tadashi Kozuno,Yoshitaka Ushiku

类目:Computation and Language (cs.CL)

关键词:Large language models, field commonly referred, Large language, quality of text, evaluate the quality

备注

点击查看摘要

Abstract:Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.

35. 【2602.02208】owards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study

链接https://arxiv.org/abs/2602.02208

作者:Md. Toufique Hasan,Ayman Asad Khan,Mika Saari,Vaishnavi Bankhele,Pekka Abrahamsson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)

关键词:English-centric training data, limited real-world evaluation, English-centric training, Large language models, Large language

备注: 6 pages, 2 figures, submitted to MIPRO 2026

点击查看摘要

Abstract:Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.

36. 【2602.02207】Sinhala Physical Common Sense Reasoning Dataset for Global PIQA

链接https://arxiv.org/abs/2602.02207

作者:Nisansa de Silva,Surangika Ranathunga

类目:Computation and Language (cs.CL)

关键词:Global PIQA, physical common sense, common sense reasoning, sense reasoning dataset, reasoning dataset created

备注

点击查看摘要

Abstract:This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.

37. 【2602.02199】More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

链接https://arxiv.org/abs/2602.02199

作者:Aryan Sood,Tanvi Sharma,Vansh Agrawal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, theoretically support extensive, support extensive context

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.

38. 【2602.02185】Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

链接https://arxiv.org/abs/2602.02185

作者:Yu Zeng,Wenxuan Huang,Zhen Fang,Shuang Chen,Yufan Shen,Yishuo Cai,Xiaoman Wang,Zhenfei Yin,Lin Chen,Zehui Chen,Shiting Huang,Yiming Zhao,Yao Hu,Philip Torr,Wanli Ouyang,Shaosheng Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Multimodal Large Language, Large Language, complex visual-textual fact-finding, Language Models

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in this https URL.

39. 【2602.02182】Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

链接https://arxiv.org/abs/2602.02182

作者:Tjaša Arčon(1),Matej Klemen(1),Marko Robnik-Šikonja(1),Kaja Dobrovoljc(1, 2, 3) ((1) University of Ljubljana, Faculty of Computer and Information Science, Slovenia (2) University of Ljubljana, Faculty of Arts, Slovenia, (3) Jožef Stefan Institute, Ljubljana, Slovenia)

类目:Computation and Language (cs.CL)

关键词:remains poorly understood, structure remains poorly, Large language models, linguistic structure remains, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.

40. 【2602.02178】AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?

链接https://arxiv.org/abs/2602.02178

作者:Liang Lin,Feng Xiong,Zengbin Wang,Kun Wang,Junhao Dong,Xuecai Hu,Yong Wang,Xiangxiang Chu

类目:Computation and Language (cs.CL)

关键词:Diffusion Large Language, Large Language Models, Diffusion Large, Large Language, enabling parallel token

备注

点击查看摘要

Abstract:Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08\% average score across all tasks and models. Our Code is available at this https URL.

41. 【2602.02160】D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

链接https://arxiv.org/abs/2602.02160

作者:Bowen Xu,Shaoyu Wu,Hao Jiang,Kai Liu,Xin Chen,Lulu Hu,Bin Yang

类目:Computation and Language (cs.CL)

关键词:complex real-world problems, Effective tool, real-world problems, address complex real-world, essential capabilities

备注

点击查看摘要

Abstract:Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at this https URL.

42. 【2602.02159】Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

链接https://arxiv.org/abs/2602.02159

作者:Lingkun Long,Yushi Huang,Shihao Bai,Ruihao Gong,Jun Zhang,Ao Zhou,Jianlei Yang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Diffusion Large Language, Language Models, Large Language, non-autoregressive decoding paradigm

备注

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: this https URL

43. 【2602.02151】Revisiting Adaptive Rounding with Vectorized Reparameterization for LLM Quantization

链接https://arxiv.org/abs/2602.02151

作者:Yuli Zhou,Qingxuan Chen,Luca Benini,Guolei Sun,Yawei Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:cross-element error cancellation, enabling cross-element error, post-training quantization, quantization by enabling, enabling cross-element

备注: 17 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, VQRound minimizes the element-wise worst-case error under $L_\infty$ norm, which is critical for handling heavy-tailed weight distributions in LLMs. Beyond reparameterization, we identify rounding initialization as a decisive factor and develop a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers using only 128 samples. Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models demonstrate that VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of the trainable parameters. Our results show that adaptive rounding can be made both scalable and fast-fitting. The code is available at this https URL.

44. 【2602.02143】Learning Generative Selection for Best-of-N

链接https://arxiv.org/abs/2602.02143

作者:Shubham Toshniwal,Aleksander Ficek,Siddhartha Jain,Wei Du,Vahid Noroozi,Sadegh Mahdavi,Somshubra Majumdar,Igor Gitman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:substantially improve LLM, improve LLM reasoning, improve LLM, LLM reasoning, compute via parallel

备注

点击查看摘要

Abstract:Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.

45. 【2602.02140】Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

链接https://arxiv.org/abs/2602.02140

作者:Chenlong Wang,Yuhang Chen,Zhihan Hu,Dongping Chen,Wenhu Chen,Sarah Wiegreffe,Tianyi Zhou

类目:Computation and Language (cs.CL)

关键词:demonstrated remarkable progress, Recent advances, unified multimodal models, demonstrated remarkable, remarkable progress

备注

点击查看摘要

Abstract:Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.

46. 【2602.02139】EvoMU: Evolutionary Machine Unlearning

链接https://arxiv.org/abs/2602.02139

作者:Pawel Batorski,Paul Swoboda

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Machine unlearning aims, Machine unlearning, sensitive or copyrighted, copyrighted material, aims to unlearn

备注

点击查看摘要

Abstract:Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine-tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over-unlearn or under-unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task-specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset-specific losses that match or outperform existing losses from the literature, without the need for a human-in-the-loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co-scientist. In contrast to previous AI co-scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3-4B-Thinking), showing the potential of AI co-scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss-based unlearning formulations on TOFU-5%, TOFU-10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at this https URL.

47. 【2602.02133】Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics

链接https://arxiv.org/abs/2602.02133

作者:Sangwoo Shin,BumJun Kim,Kyelim Lee,Moongyu Jeon,Albert No

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Autoregressive language models, Autoregressive language, language models, reversal curse, diffusion-based language models

备注

点击查看摘要

Abstract:Autoregressive language models (ARMs) suffer from the reversal curse: after learning that "$A$ is $B$", they often fail on the reverse query "$B$ is $A$". Masked diffusion-based language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to the any-order training objective. However, observing "[MASK] is $B$" during training does not necessarily teach the model to handle the reverse prompt "$B$ is [MASK]". We show that the mitigation arises from architectural structure and its interaction with training. In a one-layer Transformer encoder, weight sharing couples the two directions by making forward and reverse attention scores positively correlated. In the same setting, we further show that the corresponding gradients are aligned, so minimizing the forward loss also reduces the reverse loss. Experiments on both controlled toy tasks and large-scale diffusion language models support these mechanisms, explaining why MDMs partially overcome a failure mode that persists in strong ARMs.

48. 【2602.02132】here Is More to Refusal in Large Language Models than a Single Direction

链接https://arxiv.org/abs/2602.02132

作者:Faaiz Joad,Majd Hawasly,Sabri Boughorbel,Nadir Durrani,Husrev Taha Sencar

类目:Computation and Language (cs.CL)

关键词:Prior work argues, Prior work, enabling effective steering, large language models, enabling effective

备注

点击查看摘要

Abstract:Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

49. 【2602.02112】Unifying Masked Diffusion Models with Various Generation Orders and Beyond

链接https://arxiv.org/abs/2602.02112

作者:Chunsan Hong,Sanghyun Lee,Jong Chul Ye

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:quality depends critically, generation quality depends, masked diffusion model, potential alternative, alternative to autoregressive

备注: Preprint

点击查看摘要

Abstract:Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

50. 【2602.02108】Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

链接https://arxiv.org/abs/2602.02108

作者:Wenhao Li,Daohai Yu,Gen Luo,Yuxin Zhang,Fei Chao,Rongrong Ji,Yifan Wu,Jiaxin Liu,Ziyang Gong,Zimu Liao

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Training Large Language, prohibitive GPU memory, Large Language

备注

点击查看摘要

Abstract:Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at this https URL.

51. 【2602.02104】Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs

链接https://arxiv.org/abs/2602.02104

作者:Shaltiel Shmidman,Avi Shmidman,Amir DN Cohen,Moshe Koppel

类目:Computation and Language (cs.CL)

关键词:sovereign Large Language, Large Language Models, Large Language, Training large language, sovereign Large

备注

点击查看摘要

Abstract:Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

52. 【2602.02103】No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

链接https://arxiv.org/abs/2602.02103

作者:Liyan Xu,Mo Yu,Fandong Meng,Jie Zhou

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, prior complementary observations, subsequent reasoning prior, requiring multi-step reasoning

备注

点击查看摘要

Abstract:This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at this https URL.

53. 【2602.02099】hink Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning

链接https://arxiv.org/abs/2602.02099

作者:Keqin Peng,Yuanxin Ouyang,Xuebo Liu,Zhiliang Tian,Ruijian Han,Yancheng Yuan,Liang Ding

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reinforcement Learning, Learning with Verifiable, overly verbose traces, elicit strong multi-step, encourages overly verbose

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency--accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at this https URL.

54. 【2602.02090】LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction -- A Case Study on SDGs

链接https://arxiv.org/abs/2602.02090

作者:Yikai Zeng,Yingchao Piao,Jianhui Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Constructing domain-specific knowledge, heterogeneous entity mentions, remains challenging due, Large Language Models, Constructing domain-specific

备注

点击查看摘要

Abstract:Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.

55. 【2602.02084】Closing the Loop: Universal Repository Representation with RPG-Encoder

链接https://arxiv.org/abs/2602.02084

作者:Jane Luo,Chengyu Yin,Xin Zhang,Qingtao Li,Steven Liu,Yiming Huang,Jie Wu,Hao Liu,Yangyu Huang,Yu Kang,Fangkai Yang,Ying Xin,Scarlett Li

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:isolated API documentation, Current repository agents, existing methods rely, isolated API, API documentation

备注

点击查看摘要

Abstract:Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG's high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.

56. 【2602.02053】WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

链接https://arxiv.org/abs/2602.02053

作者:Pengyu Wang,Benfeng Xu,Licheng Zhang,Shaohan Wang,Mingxuan Du,Chiwei Zhu,Zhendong Mao

类目:Computation and Language (cs.CL)

关键词:Graph-based Retrieval-Augmented Generation, Graph-based Retrieval-Augmented, Retrieval-Augmented Generation, enabling efficient retrieval, organizes external knowledge

备注: [this https URL](https://github.com/BstWPY/WildGraphBench)

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:this https URL.

57. 【2602.02047】Dissecting Outlier Dynamics in LLM NVFP4 Pretraining

链接https://arxiv.org/abs/2602.02047

作者:Peijie Dong,Ruibo Fan,Yuechen Tao,Di Mou,Wenhu Hu,Zhenheng Tang,Yinghao Yu,Jiamang Wang,Wenbo Su,Guodong Yang,Liping Zhang,Xiaowen Chu,Baochun Li,Bo Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:arithmetic enhances throughput, arithmetic enhances, memory efficiency, enhances throughput, throughput and memory

备注: 39 pages, 32 figures

点击查看摘要

Abstract:Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with "post-QK" operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes early in training to a small set of persistent hot channels (i.e., channels with persistently large magnitudes) in later stages. Based on these findings, we introduce Hot-Channel Patch (HCP), an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. We then develop CHON, an NVFP4 training recipe integrating HCP with post-QK operation protection. On GLA-1.3B model trained for 60B tokens, CHON reduces the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy.

58. 【2602.02039】Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

链接https://arxiv.org/abs/2602.02039

作者:Wei Liu,Peijie Yu,Michele Orini,Yali Du,Yulan He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)

关键词:Agentic Large Language, Large Language Models, Large Language, Agentic Large, answering correctly

备注: 14 pages, 7 tables, 8 figures

点击查看摘要

Abstract:The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

59. 【2602.02014】Rethinking Genomic Modeling Through Optical Character Recognition

链接https://arxiv.org/abs/2602.02014

作者:Hongxin Xiang,Pengsen Ma,Yunkang Cao,Di Yu,Haowen Chen,Xinyu Yang,Xiangxiang Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:largely adopt large, adopt large language, foundation models largely, models largely adopt, Optical Character Recognition

备注

点击查看摘要

Abstract:Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.

60. 【2602.02010】NEAT: Neuron-Based Early Exit for Large Reasoning Models

链接https://arxiv.org/abs/2602.02010

作者:Kang Liu,Yongkang Liu,Xiaocui Yang,Peidong Wang,Wen Zhang,Shi Feng,Yifei Zhang,Daling Wang

类目:Computation and Language (cs.CL)

关键词:Large Reasoning Models, redundant reasoning steps, Large Reasoning, textbf, Reasoning

备注

点击查看摘要

Abstract:Large Reasoning Models (LRMs) often suffer from \emph{overthinking}, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose \textbf{NEAT}, a \textbf{N}euron-based \textbf{E}arly re\textbf{A}soning exi\textbf{T} framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22\% to 28\% when averaged over the four benchmarks, while maintaining accuracy.

61. 【2602.02007】Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

链接https://arxiv.org/abs/2602.02007

作者:Zhanghao Hu,Qinglin Zhu,Hanqi Yan,Yulan He,Lin Gui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:standard Retrieval-Augmented Generation, underlying assumptions differ, Agent memory systems, Retrieval-Augmented Generation, RAG targets large

备注

点击查看摘要

Abstract:Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-$k$ similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.

62. 【2602.01999】From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs

链接https://arxiv.org/abs/2602.01999

作者:Yanrui Du,Yibo Gao,Sendong Zhao,Jiayun Li,Haochun Wang,Qika Lin,Kai He,Bing Qin,Mengling Feng

类目:Computation and Language (cs.CL)

关键词:attracted growing attention, internal mechanisms underlying, behavior remain unclear, LLMs have attracted, remain unclear

备注

点击查看摘要

Abstract:R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at this https URL.

63. 【2602.01982】S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs

链接https://arxiv.org/abs/2602.01982

作者:Yanrui Du,Sendong Zhao,Yibo Gao,Danyang Zhao,Qika Lin,Ming Ma,Jiayun Li,Yi Jiang,Kai He,Qianyi Xu,Bing Qin,Mengling Feng

类目:Computation and Language (cs.CL)

关键词:achieve strong performance, Large language models, Large language, achieve strong, strong performance

备注

点击查看摘要

Abstract:Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at this https URL.

64. 【2602.01977】Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing

链接https://arxiv.org/abs/2602.01977

作者:Shuainan Liu,Xuanang Chen,Ben He,Le Sun

类目:Computation and Language (cs.CL)

关键词:assess edited facts, large language models, large language, commonly evaluated, evaluated using predefined

备注

点击查看摘要

Abstract:Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model's knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.

65. 【2602.01969】Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

链接https://arxiv.org/abs/2602.01969

作者:Bin Cao,Huixian Lu,Chenwen Ma,Ting Wang,Ruizhe Li,Jing Fan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:heterogeneous layouts pose, layouts pose persistent, pose persistent challenges, understanding and reasoning, heterogeneous layouts

备注: Work in process

点击查看摘要

Abstract:Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.

66. 【2602.01967】Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

链接https://arxiv.org/abs/2602.01967

作者:Wonjun Lee,Hyounghun Kim,Gary Geunbae Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:automatic speech recognition, substantial performance degradation, Accented speech remains, high-resource English varieties, English varieties

备注

点击查看摘要

Abstract:Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

67. 【2602.01965】Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.01965

作者:Kwun Hang Lau,Fangyuan Zhang,Boyu Ruan,Yingli Zhou,Qintian Guo,Ruiyuan Zhang,Xiaofang Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:leverage Knowledge Graphs, simple vector similarity, Personalized PageRank, Recent advances, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a "Static Graph Fallacy": they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree "hub" nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query's intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at this https URL.

68. 【2602.01919】From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding"

链接https://arxiv.org/abs/2602.01919

作者:Hend Al-Khalifa

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language

备注: Accepted in The Seventh Workshop on Teaching Natural Language Processing (Teaching NLP @ EACL2026)

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces ``Vibe Coding,'' a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.

69. 【2602.01917】GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs

链接https://arxiv.org/abs/2602.01917

作者:Chengguang Gan,Yoshihiro Tsujii,Yunhao Liang,Tatsunori Mori,Shiwen Ni,Hiroki Itoh

类目:Computation and Language (cs.CL)

关键词:Digital Adoption Platform, Digital Adoption, Adoption Platform, provide web-based overlays, navigate complex websites

备注

点击查看摘要

Abstract:Digital Adoption Platform (DAP) provide web-based overlays that deliver operation guidance and contextual hints to help users navigate complex websites. Although modern DAP tools enable non-experts to author such guidance, maintaining these guides remains labor-intensive because website layouts and functionalities evolve continuously, which requires repeated manual updates and re-annotation. In this work, we introduce \textbf{GuideWeb}, a new benchmark for automatic in-app guide generation on real-world web UIs. GuideWeb formulates the task as producing page-level guidance by selecting \textbf{guide target elements} grounded in the webpage and generating concise guide text aligned with user intent. We also propose a comprehensive evaluation suite that jointly measures the accuracy of guide target element selection and the quality of generated intents and guide texts. Experiments show that our proposed \textbf{GuideWeb Agent} achieves \textbf{30.79\%} accuracy in guide target element prediction, while obtaining BLEU scores of \textbf{44.94} for intent generation and \textbf{21.34} for guide-text generation. Existing baselines perform substantially worse, which highlights that automatic guide generation remains challenging and that further advances are necessary before such systems can be reliably deployed in real-world settings.

70. 【2602.01885】ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

链接https://arxiv.org/abs/2602.01885

作者:Tiantian Chen,Jiaqi Lu,Ying Shen,Lin Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, shown strong potential, shown strong

备注: 12 pages, 7 figures. Accepted to The Web Conference (WWW) 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.

71. 【2602.01875】PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

链接https://arxiv.org/abs/2602.01875

作者:Langming Liu,Kangtao Lv,Haibin Chen,Weidong Zhang,Yejing Wang,Shilei Liu,Xin Tong,Yujin Yuan,Yongwei Wang,Wenbo Su,Bo Zheng

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, generate verifiable falsehoods, powerful capabilities, generate verifiable

备注

点击查看摘要

Abstract:Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of "low-probability truth" and "high-probability falsehood". Recent approaches, such as teaching models to say "I don't know" or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is "\textbf{debiasing then learning}." It actively reshapes the model's probability distribution by down-weighting high-probability falsehoods, thereby making "room" for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model's probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.

72. 【2602.01840】Read As Human: Compressing Context via Parallelizable Close Reading and Skimming

链接https://arxiv.org/abs/2602.01840

作者:Jiwei Tang,Shilei Liu,Zhicheng Zhang,Qingsong Lv,Runsong Zhao,Tingwei Lu,Langming Liu,Haibin Chen,Yujin Yuan,Hai-Tao Zheng,Wenbo Su,Bo Zheng

类目:Computation and Language (cs.CL)

关键词:Large Language Models, demonstrate exceptional capability, Large Language, Language Models, diverse tasks

备注: 13 pages,5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).

73. 【2602.01838】AXE: Low-Cost Cross-Domain Web Structured Information Extraction

链接https://arxiv.org/abs/2602.01838

作者:Abdelrahman Mansour,Khaled W. Alshaer,Moataz Elsaban

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Extracting structured data, Large Language, cost of Large, Adaptive X-Path Extractor

备注

点击查看摘要

Abstract:Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.

74. 【2602.01807】Sentence Curve Language Models

链接https://arxiv.org/abs/2602.01807

作者:DongNyeong Heo,Heelyoul Choi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:diffusion-based language models, modern AI systems, competitive alternative, sentence, central component

备注

点击查看摘要

Abstract:Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

75. 【2602.01785】CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

链接https://arxiv.org/abs/2602.01785

作者:Yuling Shi,Chaoxiang Xie,Zhensu Sun,Yeheng Chen,Chenxu Zhang,Longfei Yun,Chengcheng Wan,Hongyu Zhang,David Lo,Xiaodong Gu

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Large Language, achieved remarkable success, software systems grow, Language Models

备注: Code and data are available at [this https URL](https://github.com/YerbaPage/CodeOCR)

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

76. 【2602.01778】Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

链接https://arxiv.org/abs/2602.01778

作者:Kangtao Lv,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Shilei Liu,Yongwei Wang,Yujin Yuan,Wenbo Su,Bo Zheng

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, deployment of Large, significant information redundancy, Language Models

备注: 15 pages,6 figures

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.

77. 【2602.01771】SOG_k: One LLM Token for Explicit Graph Structural Understanding

链接https://arxiv.org/abs/2602.01771

作者:Jingyao Wu,Bin Lu,Zijun Di,Xiaoying Gan,Meng Jin,Luoyi Fu,Xinbing Wang,Chenghu Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)

关键词:Large language models, models show great, show great potential, face significant challenges, language models show

备注

点击查看摘要

Abstract:Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token SOG_k to fully represent the Structure Of Graph within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, SOG_k empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9% to 41.4% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available at this https URL.

78. 【2602.01762】PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

链接https://arxiv.org/abs/2602.01762

作者:Xuliang Wang,Yuetao Chen,Maochan Zhen,Fang Liu,Xinzhou Zheng,Xingwu Liu,Hong Xu,Ming Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, auto-regressive nature, suffer from slow, Large

备注

点击查看摘要

Abstract:Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.01762 [cs.AI]

(or
arXiv:2602.01762v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.01762

Focus to learn more

              arXiv-issued DOI via DataCite</p>
79. 【2602.01757】Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings

链接https://arxiv.org/abs/2602.01757

作者:Doohyun Kim,Donghwa Kang,Kyungjae Lee,Hyeongboo Baek,Brent Byunghoon Kang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:established vector databases, embedding inversion attacks, severe privacy risks, critical infrastructure, inversion attacks

备注: 10 pages

点击查看摘要

Abstract:The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.

80. 【2602.01752】WorldCup Sampling for Multi-bit LLM Watermarking

链接https://arxiv.org/abs/2602.01752

作者:Yidan Wang,Yubing Ren,Yanan Cao,Li Guo

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:large language models, generate increasingly human-like, increasingly human-like text, language models, generate increasingly

备注

点击查看摘要

Abstract:As large language models (LLMs) generate increasingly human-like text, watermarking offers a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing methods largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup further adopts entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding. Comprehensive experiments show that WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines and laying a solid foundation for future LLM watermarking studies.

81. 【2602.01747】Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training

链接https://arxiv.org/abs/2602.01747

作者:Hongseok Choi,Serynn Kim,Wencke Liermann,Jin Seong,Jin-Xia Huang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Automated Essay Scoring, efficient assessment tools, Automated Essay, Essay Scoring, plays a crucial

备注: 22 pages, 4 figures

点击查看摘要

Abstract:Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.

82. 【2602.01725】SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

链接https://arxiv.org/abs/2602.01725

作者:Yurun Chen,Zeyi Liao,Ping Yin,Taotao Xie,Keting Yin,Shengyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:complex real-world environments, deployment of Computer-using, Computer-using Agents, current observation space, prevalent long-term risks

备注

点击查看摘要

Abstract:With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

83. 【2602.01719】COMI: Coarse-to-fine Context Compression via Marginal Information Gain

链接https://arxiv.org/abs/2602.01719

作者:Jiwei Tang,Shilei Liu,Zhicheng Zhang,Yujin Yuan,Libin Zheng,Wenbo Su,Bo Zheng

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, demonstrated exceptional capabilities, Large Language, diverse tasks

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.

84. 【2602.01717】BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

链接https://arxiv.org/abs/2602.01717

作者:Hyunsik Kim,Haeri Kim,Munhak Lee,Kyungmin Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:automatic speech recognition, Multilingual automatic speech, speech recognition, writing systems, automatic speech

备注: accepted to ICASSP 2026

点击查看摘要

Abstract:Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.

85. 【2602.01716】Mechanistic Indicators of Steering Effectiveness in Large Language Models

链接https://arxiv.org/abs/2602.01716

作者:Mehdi Jafari,Hao Xue,Flora Salim

类目:Computation and Language (cs.CL)

关键词:enables Large Language, Large Language Models, Large Language, Activation-based steering enables, steering enables Large

备注

点击查看摘要

Abstract:Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.

86. 【2602.01714】MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

链接https://arxiv.org/abs/2602.01714

作者:Mouath Abu-Daoud,Leen Kharouf,Omar El Hajj,Dana El Samad,Mariam Al-Omari,Jihad Mallat,Khaled Saleh,Nizar Habash,Farah E. Shamout

类目:Computation and Language (cs.CL)

关键词:natural language processing, language processing research, Large Language Models, underrepresented languages, natural language

备注

点击查看摘要

Abstract:Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.

87. 【2602.01709】ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

链接https://arxiv.org/abs/2602.01709

作者:Xingshan Zeng,Lingzhi Wang,Weiwen Liu,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu

类目:Computation and Language (cs.CL)

关键词:Current test-time scaling, techniques enhance large, enhance large language, allocating additional computation, Current test-time

备注

点击查看摘要

Abstract:Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.

88. 【2602.01708】Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory

链接https://arxiv.org/abs/2602.01708

作者:Langyuan Cui,Chun Kai Ling,Hwee Tou Ng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

关键词:Large Language Models, lack sufficient information, Large Language, Language Models, increasingly deployed

备注: 23 pages, 10 figures, under review at ICML 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.

89. 【2602.01703】$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

链接https://arxiv.org/abs/2602.01703

作者:Pengyu Li,Lingling Zhang,Zhitao Gao,Yanrui Wu,Yuxuan Dong,Huan Liu,Bifan Wei,Jun Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, posing critical privacy, achieved remarkable capabilities, memorize sensitive data

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at this https URL.

90. 【2602.01698】Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

链接https://arxiv.org/abs/2602.01698

作者:Wenhui Tan,Fiorenzo Parascandolo,Enver Sangineto,Jianzhong Ju,Zhenbo Luo,Qian Cao,Rita Cucchiara,Ruihua Song,Jian Luan

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reinforcement Learning, Large Reasoning Models, recently achieved strong, achieved strong mathematical, performance through Reinforcement

备注

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: this https URL.

91. 【2602.01687】Counting Hypothesis: Potential Mechanism of In-Context Learning

链接https://arxiv.org/abs/2602.01687

作者:Jung H. Lee,Sujith Vijayan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, In-Context Learning, learn specific tasks, language models, ICL

备注: 19 pages, 7 main Figures, 1 Table and 6 Supp. Figures

点击查看摘要

Abstract:In-Context Learning (ICL) indicates that large language models (LLMs) pretrained on a massive amount of data can learn specific tasks from input prompts' examples. ICL is notable for two reasons. First, it does not need modification of LLMs' internal structure. Second, it enables LLMs to perform a wide range of tasks/functions with a few examples demonstrating a desirable task. ICL opens up new ways to utilize LLMs in more domains, but its underlying mechanisms still remain poorly understood, making error correction and diagnosis extremely challenging. Thus, it is imperative that we better understand the limitations of ICL and how exactly LLMs support ICL. Inspired by ICL properties and LLMs' functional modules, we propose 1the counting hypothesis' of ICL, which suggests that LLMs' encoding strategy may underlie ICL, and provide supporting evidence.

92. 【2602.01672】Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

链接https://arxiv.org/abs/2602.01672

作者:Siheng Xiong,Oguzhan Gungordu,Blair Johnson,James C. Kerce,Faramarz Fekri

类目:Computation and Language (cs.CL)

关键词:agents interleave multi-step, interleave multi-step reasoning, context saturation, interleave multi-step, leads to redundant

备注: Work in progress

点击查看摘要

Abstract:Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.

93. 【2602.01660】CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

链接https://arxiv.org/abs/2602.01660

作者:Zhongyuan Peng,Caijun Xu,Changyi Xiao,Shibo Hong,Eli Zhang,Stephen Huang,Yixin Cao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Reasoning Models, Large Reasoning, Difficult Question Generation, Controllable Difficult Question, Large

备注: 11 pages, 5 tables, 5 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model's ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.

94. 【2602.01654】Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models

链接https://arxiv.org/abs/2602.01654

作者:Jiaqian Li,Yanshu Li,Kuan-Hao Huang

类目:Computation and Language (cs.CL)

关键词:large language models, practical middle ground, shifting hidden activations, control large language, offer a lightweight

备注

点击查看摘要

Abstract:Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.

95. 【2602.01640】A2Eval: Agentic and Automated Evaluation for Embodied Brain

链接https://arxiv.org/abs/2602.01640

作者:Shuai Zhang,Jiayu Hu,Zijie Chen,Zeyuan Ding,Yi Zhang,Yingji Zhang,Ziyi Zhou,Junwei Liao,Shengjie Zhou,Yong Dai,Zhenzhong Lan,Xiaozhu Ju

类目:Computation and Language (cs.CL)

关键词:Current embodied VLM, VLM evaluation relies, exhibit severe redundancy, manually annotated benchmarks, embodied VLM evaluation

备注

点击查看摘要

Abstract:Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.

96. 【2602.01618】SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia

链接https://arxiv.org/abs/2602.01618

作者:Panuthep Tasawong,Jian Gang Ngui,Alham Fikri Aji,Trevor Cohn,Peerat Limkonchotiwat

类目:Computation and Language (cs.CL)

关键词:encompasses diverse local, Culturally aware safeguards, real-world settings, Culturally aware, alignment in real-world

备注: Under reivew

点击查看摘要

Abstract:Culturally aware safeguards are crucial for AI alignment in real-world settings, where safety extends beyond common sense and encompasses diverse local values, norms, and region-specific regulations. However, building large-scale, culturally grounded datasets is challenging due to limited resources and a scarcity of native annotators. Consequently, many safeguard models rely on machine translation of English datasets, often missing regional and cultural nuances. We present a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia (SEA). On this foundation, we introduce the SEA-Guard family, the first multilingual safeguard models grounded in SEA cultural contexts. Evaluated across multiple benchmarks and cultural variants, SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content while maintaining strong general safety performance.

97. 【2602.01601】Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

链接https://arxiv.org/abs/2602.01601

作者:Hieu Trung Nguyen,Bao Nguyen,Wenao Ma,Yuzhi Zhao,Ruifeng She,Viet Anh Nguyen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:verifiable rewards, key bottleneck, bottleneck in reinforcement, reinforcement learning, learning with verifiable

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce \Ours, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, \Ours~uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that \Ours~consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks. Our code will be available at this https URL.

98. 【2602.01600】Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

链接https://arxiv.org/abs/2602.01600

作者:Yen-Shan Chen,Zhi Rui Tam,Cheng-Kuang Wu,Yun-Nung Chen

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:LLM safety predominantly, safety predominantly rely, Current evaluations, evaluations of LLM, LLM safety

备注

点击查看摘要

Abstract:Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform risk across all malicious queries, neglecting Execution Likelihood--the conditional probability of a threat being realized given the model's response. In this work, we introduce Expected Harm, a metric that weights the severity of a jailbreak by its execution likelihood, modeled as a function of execution cost. Through empirical analysis of state-of-the-art models, we reveal a systematic Inverse Risk Calibration: models disproportionately exhibit stronger refusal behaviors for low-likelihood (high-cost) threats while remaining vulnerable to high-likelihood (low-cost) queries. We demonstrate that this miscalibration creates a structural vulnerability: by exploiting this property, we increase the attack success rate of existing jailbreaks by up to $2\times$. Finally, we trace the root cause of this failure using linear probing, which reveals that while models encode severity in their latent space to drive refusal decisions, they possess no distinguishable internal representation of execution cost, making them "blind" to this critical dimension of risk.

99. 【2602.01598】he Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation

链接https://arxiv.org/abs/2602.01598

作者:Mingwen Zhang,Minqiang Yang,Changsheng Ma,Yang Yu,Hui Bai,Chen Xu,Xiangzhen Kong,Bin Hu

类目:Computation and Language (cs.CL)

关键词:deliberately initiate structured, therapists deliberately initiate, cognitive behavioral therapy, cognition-guiding inquiries, Socratic Inquiry Framework

备注

点击查看摘要

Abstract:Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbf{Socratic Inquiry Framework (SIF)}, a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbf{when to ask} (via Strategy Anchoring) from \textbf{what to ask} (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbf{Socratic-QA}, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.

100. 【2602.01590】Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

链接https://arxiv.org/abs/2602.01590

作者:Shaohan Wang,Benfeng Xu,Licheng Zhang,Mingxuan Du,Chiwei Zhu,Xiaorui Wang,Zhendong Mao,Yongdong Zhang

类目:Computation and Language (cs.CL)

关键词:demonstrated remarkable capabilities, autonomous information retrieval, Deep Research Agents, complex research tasks, showing great potential

备注: Preprint. Work in progress

点击查看摘要

Abstract:Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at this https URL

101. 【2602.01587】Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

链接https://arxiv.org/abs/2602.01587

作者:Zehua Cheng,Jianwei Yang,Wei Dai,Jiahao Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, easily bypass empirical, bypass empirical defenses, defenses like GCG

备注: 10 pages

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.

102. 【2602.01572】LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

链接https://arxiv.org/abs/2602.01572

作者:Yeqin Zhang,Yunfei Wang,Jiaxuan Chen,Ke Qin,Yizheng Zhao,Cam-Tu Nguyen

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Natural Language Processing, Language Processing, Natural Language, Large Language Models, leverage Large Language

备注

点击查看摘要

Abstract:Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

103. 【2602.01566】FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

链接https://arxiv.org/abs/2602.01566

作者:Chiwei Zhu,Benfeng Xu,Mingxuan Du,Shaohan Wang,Xiaorui Wang,Zhendong Mao,Yongdong Zhang

类目:Computation and Language (cs.CL)

关键词:representative long-horizon task, large language model, Deep research, representative long-horizon, long-horizon task

备注: 19 pages, 6 figures

点击查看摘要

Abstract:Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at this https URL.

104. 【2602.01560】Argument Rarity-based Originality Assessment for AI-Assisted Writing

链接https://arxiv.org/abs/2602.01560

作者:Keito Inoshita,Michiaki Omura,Tsukasa Yamanaka,Go Maeda,Kentaro Tsuji

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, traditional quality-focused writing, effortlessly generating high-quality

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.

105. 【2602.01539】MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

链接https://arxiv.org/abs/2602.01539

作者:Xiaoyu Wen,Zhida He,Han Qi,Ziyu Wan,Zhongtian Ma,Ying Wen,Tianhang Zheng,Xingcheng Xu,Chaochao Lu,Qiaosheng Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Large Language Models, pre-collected data distributions, Large Language, crucial for Large, Ensuring robust safety

备注

点击查看摘要

Abstract:Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at this https URL.

106. 【2602.01523】A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

链接https://arxiv.org/abs/2602.01523

作者:Akifumi Wachi,Hirota Kinoshita,Shokichi Takakura,Rei Higuchi,Taiji Suzuki

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, language models, Reinforcement learning, dominant paradigm, paradigm for improving

备注: 28 pages

点击查看摘要

Abstract:Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $\xi := H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $\xi$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($\xi \to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($\xi=\Theta(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($\xi \to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $\xi \in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.

107. 【2602.01511】Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

链接https://arxiv.org/abs/2602.01511

作者:Ran Xu,Tianci Liu,Zihan Dong,Tony You,Ilgee Hong,Carl Yang,Linjun Zhang,Tao Zhao,Haoyu Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Standard reward models, reward models typically, models typically predict, typically predict scalar, predict scalar scores

备注: The first two authors contributed equally

点击查看摘要

Abstract:Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

108. 【2602.01479】Ebisu: Benchmarking Large Language Models in Japanese Finance

链接https://arxiv.org/abs/2602.01479

作者:Xueqing Peng,Ruoyu Xiang,Fan Zhang,Mingzi Song,Mingyang Jiang,Yan Wang,Lingfei Qian,Taiki Hara,Yuqing Guo,Jimin Huang,Junichi Tsujii,Sophia Ananiadou

类目:Computation and Language (cs.CL)

关键词:head-final linguistic structure, Japanese finance combines, finance combines agglutinative, high-context communication norms, mixed writing systems

备注

点击查看摘要

Abstract:Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing QA, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.

109. 【2602.01472】ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

链接https://arxiv.org/abs/2602.01472

作者:Jie Deng,Shining Liang,Jun Li,Hongzhi Li,Yutao Xie

类目:Computation and Language (cs.CL)

关键词:typically solve reasoning-intensive, substantial inference overhead, solve reasoning-intensive tasks, Large reasoning models, typically solve

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.

110. 【2602.01451】Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language

链接https://arxiv.org/abs/2602.01451

作者:Umme Abira Azmary,MD Ikramul Kayes,Swakkhar Shatabda,Farig Yousuf Sadeque

类目:Computation and Language (cs.CL)

关键词:Bangla face challenges, face challenges due, limited annotated data, linguistic complexity, Bangla

备注

点击查看摘要

Abstract:Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.

111. 【2602.01447】SentiFuse: Deep Multi-model Fusion Framework for Robust Sentiment Extraction

链接https://arxiv.org/abs/2602.01447

作者:Hieu Minh Duong,Rupa Ghosh,Cong Hoan Nguyen,Eugene Levin,Todd Gary,Long Nguyen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:exhibit complementary strengths, existing approaches lack, models exhibit complementary, complementary strengths, effective integration

备注

点击查看摘要

Abstract:Sentiment analysis models exhibit complementary strengths, yet existing approaches lack a unified framework for effective integration. We present SentiFuse, a flexible and model-agnostic framework that integrates heterogeneous sentiment models through a standardization layer and multiple fusion strategies. Our approach supports decision-level fusion, feature-level fusion, and adaptive fusion, enabling systematic combination of diverse models. We conduct experiments on three large-scale social-media datasets: Crowdflower, GoEmotions, and Sentiment140. These experiments show that SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves the strongest overall effectiveness, yielding up to 4\% absolute improvement in F1 score over the best individual model and simple averaging, while adaptive fusion enhances robustness on challenging cases such as negation, mixed emotions, and complex sentiment expressions. These results demonstrate that systematically leveraging model complementarity yields more accurate and reliable sentiment analysis across diverse datasets and text types.

112. 【2602.01442】he Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks

链接https://arxiv.org/abs/2602.01442

作者:Donald Ye

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:improve generalization, unimportant low-gradient components, important high-gradient components, neural network, network can improve

备注: 8 pages, 4 figures. Submitted to the ICLR 2026 Workshop on Latent Implicit Thinking (LIT). Code: [this https URL](https://anonymous.4open.science/r/ICLR_2026_LIT-workshop_CG-D42B)

点击查看摘要

Abstract:Removing ''important'' high-gradient components from a neural network can improve generalization, while removing unimportant'' low-gradient components can destroy it. We demonstrate this paradox by formalizing the \textit{Gradient-Causal Gap} in Transformers trained on algorithmic tasks. While gradient magnitude and causal importance align on simple tasks ($\rho=0.73$ for reversal), this relationship collapses as task complexity increases ($\rho=0.32$ for sorting), sometimes becoming inverted ($\rho=-0.11$). Pruning experiments reveal that gradient magnitude is not merely inaccurate but \textit{unpredictably} so. Removing low-gradient ''Hidden Heroes'' consistently devastates OOD accuracy ($-32\%$). Removing high-gradient ''Gradient Bloats'' is a coin flip: harmless in most seeds (indicating optimization noise), catastrophic in others (indicating overfitting circuits). This unpredictability means gradient-based pruning cannot reliably preserve model capabilities.

113. 【2602.01401】From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis

链接https://arxiv.org/abs/2602.01401

作者:Niansong Zhang,Sunwoo Kim,Shreesha Srinath,Zhiru Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, raising the question, HLS, high-level synthesis, rise of large

备注

点击查看摘要

Abstract:The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic optimization. This position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.

114. 【2602.01395】Rethinking Selective Knowledge Distillation

链接https://arxiv.org/abs/2602.01395

作者:Almog Tavor,Itay Ebenspanger,Neil Cnaan,Mor Geva

类目:Computation and Language (cs.CL)

关键词:large language models, Growing efforts, vocabulary classes, language models, large language

备注

点击查看摘要

Abstract:Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

115. 【2602.01381】On the Power of (Approximate) Reward Models for Inference-Time Scaling

链接https://arxiv.org/abs/2602.01381

作者:Youheng Zhu,Yiping Lu

类目:Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:Sequential Monte Carlo, reward model, approximate reward, approximate reward models, approximate reward model

备注

点击查看摘要

Abstract:Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.

Subjects:

Computation and Language (cs.CL); Machine Learning (stat.ML)

Cite as:
arXiv:2602.01381 [cs.CL]

(or
arXiv:2602.01381v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.01381

Focus to learn more

              arXiv-issued DOI via DataCite</p>
116. 【2602.01378】Context Dependence and Reliability in Autoregressive Language Models

链接https://arxiv.org/abs/2602.01378

作者:Poushali Sengupta,Shashi Raj Pandey,Sabita Maharjan,Frank Eliassen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词:Large language models, Large language, utilizing extensive context, retrieved passages, includes redundant information

备注

点击查看摘要

Abstract:Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

117. 【2602.01365】When Domains Interact: Asymmetric and Order-Sensitive Cross-Domain Effects in Reinforcement Learning for Reasoning

链接https://arxiv.org/abs/2602.01365

作者:Wang Yang,Shouren Wang,Chaoda Song,Chuang Ma,Xinpeng Li,Nengbo Wang,Kaixiong Zhou,Vipin Chaudhary,Xiaotian Han

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Group Relative Policy, Relative Policy Optimization, Group Relative, Policy Optimization, Relative Policy

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has become a key technique for improving reasoning abilities in large language models, yet its behavior under different domain sequencing strategies is poorly understood. In particular, the impact of sequential (one domain at a time) versus mixed-domain (multiple domain at a time) training in GRPO has not been systematically studied. We provide the first systematic analysis of training-order effects across math, science, logic, and puzzle reasoning tasks. We found (1) single-domain generalization is highly asymmetric: training on other domains improves math reasoning by approximately 25\% accuracy, while yielding negligible transfer to logic and puzzle; (2) cross-domain interactions are highly order-dependent: training in the order math$\rightarrow$science achieves 83\% / 41\% accuracy on math / science, while reversing the order to science$\rightarrow$math degrades performance to 77\% / 25\%; (3) no single strategy is universally optimal in multi-domain training: sequential training favors math (up to 84\%), mixed training favors science and logic, and poor ordering can incur large performance gaps (from 70\% to 56\%). Overall, our findings demonstrate that GRPO under multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design.

118. 【2602.01363】Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings

链接https://arxiv.org/abs/2602.01363

作者:Mariëtte Olijslager,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:sensitive demographic attributes, encode sensitive demographic, speaker verification systems, Self-supervised speaker embeddings, speaker verification performance

备注

点击查看摘要

Abstract:Self-supervised speaker embeddings are widely used in speaker verification systems, but prior work has shown that they often encode sensitive demographic attributes, raising fairness and privacy concerns. This paper investigates the extent to which demographic information, specifically gender, age, and accent, is present in SimCLR-trained speaker embeddings and whether such leakage can be mitigated without severely degrading speaker verification performance. We study two debiasing strategies: adversarial training through gradient reversal and a causal bottleneck architecture that explicitly separates demographic and residual information. Demographic leakage is quantified using both linear and nonlinear probing classifiers, while speaker verification performance is evaluated using ROC-AUC and EER. Our results show that gender information is strongly and linearly encoded in baseline embeddings, whereas age and accent are weaker and primarily nonlinearly represented. Adversarial debiasing reduces gender leakage but has limited effect on age and accent and introduces a clear trade-off with verification accuracy. The causal bottleneck further suppresses demographic information, particularly in the residual representation, but incurs substantial performance degradation. These findings highlight fundamental limitations in mitigating demographic leakage in self-supervised speaker embeddings and clarify the trade-offs inherent in current debiasing approaches.

119. 【2602.01362】Balancing Understanding and Generation in Discrete Diffusion Models

链接https://arxiv.org/abs/2602.01362

作者:Yue Liu,Yuzhong Zhao,Zheyong Xie,Qixiang Ye,Jianbin Jiao,Yao Hu,Shaosheng Cao,Yunfan Liu

类目:Computation and Language (cs.CL)

关键词:Masked Diffusion Language, Uniform-noise Diffusion Language, Diffusion Language Models, Masked Diffusion, Uniform-noise Diffusion

备注: 32 pages, Code is available at [this https URL](https://github.com/MzeroMiko/XDLM)

点击查看摘要

Abstract:In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM's superior potential for long-term scaling. Code is available at this https URL

120. 【2602.01348】CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

链接https://arxiv.org/abs/2602.01348

作者:Yu Liu,Wenxiao Zhang,Cong Cao,Fangfang Yuan,Weizhuo Chen,Cheng Hu,Pin Xu,Yuling Yang,Kun Peng,Diandian Guo,Qiang Sun,Yanbing Liu,Jin B. Hong,Zhiyuan Ma

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:ground Large Language, Large Language Models, Large Language, ground Large, multi-hop question answering

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.

121. 【2602.01331】A-MapReduce: Executing Wide Search via Agentic MapReduce

链接https://arxiv.org/abs/2602.01331

作者:Mingju Chen,Guibin Zhang,Heng Chang,Yuchen Guo,Shiji Zhou

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词:Contemporary large language, large language model, systems exhibit systematic, exhibit systematic advantages, structured information seeking

备注: 33 pages

点击查看摘要

Abstract:Contemporary large language model (LLM)-based multi-agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large-scale, breadth-oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long-horizon execution. To bridge this gap, we propose A-MapReduce, a MapReduce paradigm-inspired multi-agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A-MapReduce implements parallel processing of massive retrieval targets through task-adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query-conditioned task allocation and recomposition, enabling progressive improvement in large-scale wide-search regimes. Extensive experiments on five agentic benchmarks demonstrate that A-MapReduce is (i) high-performing, achieving state-of-the-art performance on WideSearch and DeepWideSearch, and delivering 5.11% - 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost-effective and efficient, delivering superior cost-performance trade-offs and reducing running time by 45.8\% compared to representative multi-agent baselines. The code is available at this https URL.

122. 【2602.01326】DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas

链接https://arxiv.org/abs/2602.01326

作者:Zirui Wu,Lin Zheng,Zhihui Xie,Jiacheng Ye,Jiahui Gao,Shansan Gong,Yansong Feng,Zhenguo Li,Wei Bi,Guorui Zhou,Lingpeng Kong

类目:Computation and Language (cs.CL)

关键词:specialized prompting design, Diffusion Language Models, offering flexible, Diffusion Language, present a compelling

备注: ICLR 2026

点击查看摘要

Abstract:Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code is available at this https URL.

123. 【2602.01322】PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

链接https://arxiv.org/abs/2602.01322

作者:Panagiotis Koromilas,Andreas D. Demou,James Oldfield,Yannis Panagakis,Mihalis Nicolaou

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:interpreting neural network, neural network representations, Sparse autoencoders, sparse combinations, dictionary atoms

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether "Starbucks" arises from the composition of "star" and "coffee" features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs. $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.

124. 【2602.01313】EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

链接https://arxiv.org/abs/2602.01313

作者:Chuanrui Hu,Tong Li,Xingze Gao,Hongda Chen,Dannong Xu,Yi Bai,Tianwei Lin,Xinda Zhao,Xiaohong Li,Jiaqi An,Yunyun Han,Jian Pei,Yafeng Deng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:capture real-world complexity, Long-term conversational memory, Long-term conversational, existing benchmarks focus, LLM-based assistants

备注: 10 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

125. 【2602.01274】PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length

链接https://arxiv.org/abs/2602.01274

作者:Situo Zhang,Yifan Zhang,Zichen Zhu,Hankun Wang,Da Ma,Danyang Zhang,Lu Chen,Kai Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, sacrificing accuracy, powerful technique, technique for accelerating, accelerating the inference

备注

点击查看摘要

Abstract:Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.

126. 【2602.01246】PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian

链接https://arxiv.org/abs/2602.01246

作者:Jamshid Mozafari,Seyed Parsa Mousavinasab,Adam Jatowt

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Reasoning-focused Question Answering, languages remain scarce, Large Language Models, Large Language, low-resource languages remain

备注: Submitted to SIGIR 2026

点击查看摘要

Abstract:Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.

127. 【2602.01244】Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

链接https://arxiv.org/abs/2602.01244

作者:Siwei Wu,Yizhi Li,Yuyang Song,Wei Zhang,Yang Wang,Riza Batista-Navarro,Xian Yang,Mingjie Tang,Bryan Dai,Jian Yang,Chenghua Lin

类目:Computation and Language (cs.CL)

关键词:capture realistic long-horizon, realistic long-horizon interactions, Training agentic models, Training agentic, terminal-based tasks critically

备注: Agentic Trajectory, Agentic Model, Terminal, Code Agent

点击查看摘要

Abstract:Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at this https URL.

128. 【2602.01240】Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

链接https://arxiv.org/abs/2602.01240

作者:Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang

类目:Computation and Language (cs.CL)

关键词:Zero-shot methods detect, methods detect LLM-generated, detect LLM-generated text, computing statistical signatures, Zero-shot methods

备注

点击查看摘要

Abstract:Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.

129. 【2602.01239】Inferential Question Answering

链接https://arxiv.org/abs/2602.01239

作者:Jamshid Mozafari,Hamed Zamani,Guido Zuccon,Adam Jatowt

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:existing work focuses, extensive research, wide range, existing work, work focuses

备注: Proceedings of the ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.

130. 【2602.01227】Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

链接https://arxiv.org/abs/2602.01227

作者:Zhanming Shen,Zeyu Qin,Jiaqi Hu,Wentao Ye,Hao Chen,Xiaomeng Hu,Haokai Xu,Gang Chen,Yi R. Fung,Haobo Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:achieving true human, true human utility, fine-grained autoregressive generation, fitting empirical data, granularity mismatch

备注

点击查看摘要

Abstract:The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.

131. 【2602.01212】SimpleGPT: Improving GPT via A Simple Normalization Strategy

链接https://arxiv.org/abs/2602.01212

作者:Marco Chen,Xianbiao Qi,Yelin He,Jiaquan Ye,Rong Xiao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:revisit Transformer optimization, maximum tolerable learning, revisit Transformer, tolerable learning rate, Hessian matrix

备注: We propose SimpleGPT, a simple yet effective GPT model, and provide theoretical insights into its mathematical foundations. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B

点击查看摘要

Abstract:In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at this https URL.

132. 【2602.01208】Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling

链接https://arxiv.org/abs/2602.01208

作者:Kai Zhang,Jiayi Liao,Chengpeng Li,Ziyuan Xie,Sihang Li,Xiang Wang

类目:Computation and Language (cs.CL)

关键词:Test-Time Scaling, large language models, effective paradigm, paradigm for improving, performance of large

备注

点击查看摘要

Abstract:Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods -- most notably majority voting and heuristic token-level scoring -- treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21\% over Pass@1 and 22.70\% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.

133. 【2602.01204】ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

链接https://arxiv.org/abs/2602.01204

作者:Xuqin Zhang,Quan He,Zhenrui Zheng,Zongzhang Zhang,Xu He,Dong Li

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, eliciting long-horizon reasoning, Language Models, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.

134. 【2602.01203】Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

链接https://arxiv.org/abs/2602.01203

作者:Zizhuo Fu,Wenxuan Zeng,Runsheng Wang,Meng Li

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, assign disproportionate attention, attention, Sink Attention

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

135. 【2602.01193】Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

链接https://arxiv.org/abs/2602.01193

作者:Shashini Nilukshi,Deshan Sumanathilaka

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Word Sense Disambiguation, traditional Word Sense, Visual Word Sense, Word Sense, Sense Disambiguation

备注: 2 figures, 2 Tables, Accepted at IEEE TIC 2026

点击查看摘要

Abstract:This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8\% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.

136. 【2602.01171】ASP-Bench: From Natural Language to Logic Programs

链接https://arxiv.org/abs/2602.01171

作者:Stefan Szeider

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:affects neurosymbolic engineering, Automating the translation, Answer Set Programs, neurosymbolic engineering, challenging task

备注

点击查看摘要

Abstract:Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem's modeling hardness.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

Cite as:
arXiv:2602.01171 [cs.AI]

(or
arXiv:2602.01171v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2602.01171

Focus to learn more

              arXiv-issued DOI via DataCite</p>
137. 【2602.01170】EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech

链接https://arxiv.org/abs/2602.01170

作者:Besher Hassan,Ibrahim Alsarraj,Musaab Hasan,Yousef Melhim,Shahem Fadi,Shahem Sultan

类目:Computation and Language (cs.CL)

关键词:affects service quality, banking customer service, context affects service, cross-lingual spoken communication, work presents EmoAra

备注: 10 pages, 3 figures

点击查看摘要

Abstract:This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.

138. 【2602.01169】PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues

链接https://arxiv.org/abs/2602.01169

作者:Shahem Sultan,Shahem Fadi,Yousef Melhim,Ibrahim Alsarraj,Besher Hassan

类目:Computation and Language (cs.CL)

关键词:improving interaction quality, recommending effective pedagogical, effective pedagogical strategies, paper addresses, addresses the challenge

备注: 8 pages, 5 figures

点击查看摘要

Abstract:This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.

139. 【2602.01162】ypologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages

链接https://arxiv.org/abs/2602.01162

作者:Nipuna Abeykoon,Ashen Weerathunga,Pubudu Wijesinghe,Parameswari Krishnamurthy

类目:Computation and Language (cs.CL)

关键词:exhibit systematic biases, typologically divergent low-resource, Large language models, models trained predominantly, dominant typological patterns

备注

点击查看摘要

Abstract:Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.

140. 【2602.01161】Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

链接https://arxiv.org/abs/2602.01161

作者:Reem I. Masoud,Chen Feng,Shunta Asano,Saied Alshahrani,Philip Colin Treleaven,Miguel R. D. Rodrigues

类目:Computation and Language (cs.CL)

关键词:remain poorly understood, adaptation remain poorly, cultural adaptation remain, large language models, poorly understood

备注

点击查看摘要

Abstract:The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.

141. 【2602.01132】Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation

链接https://arxiv.org/abs/2602.01132

作者:Abhilekh Borah,Shubhra Ghosh,Kedar Joshi,Aditya Kumar Guru,Kripabandhu Ghosh

类目:Computation and Language (cs.CL)

关键词:solving arithmetic equations, Obfus Blood Relation, Obfus Number Series, evaluating truth tables, Obfus Direction Sense

备注: 19 pages, 6 figures

点击查看摘要

Abstract:Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.

142. 【2602.01125】Long-range Modeling and Processing of Multimodal Event Sequences

链接https://arxiv.org/abs/2602.01125

作者:Jichu Li,Yilun Zhong,Zhiting Li,Feng Zhou,Quyu Kong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Temporal point processes, modeling asynchronous event, point processes, emerged as powerful, powerful tools

备注

点击查看摘要

Abstract:Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.

143. 【2602.01120】MarkovScale: Towards Optimal Sequential Scaling at Inference Time

链接https://arxiv.org/abs/2602.01120

作者:Youkang Wang,Jian Wang,Rubing Chen,Tianyi Zeng,Xiao-Yong Wei,Qing Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:inference-time scaling paradigm, prominent inference-time scaling, obscure clear optimality, Sequential scaling, clear optimality bounds

备注: 12 pages

点击查看摘要

Abstract:Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.

144. 【2602.01119】ndem: A Hybrid AI+Human Platform

链接https://arxiv.org/abs/2602.01119

作者:Konstantin Chernyshev,Ekaterina Artemova,Viacheslav Zhukov,Maksim Nerush,Mariia Fedorova,Iryna Repik,Olga Shapovalova,Aleksey Sukhorosov,Vladimir Dobrovolskii,Natalia Mikhailova,Sergei Tilga

类目:Computation and Language (cs.CL)

关键词:Human Experts step, Experts step, handles structured, repeatable work, Human Experts

备注

点击查看摘要

Abstract:Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem's performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem's AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.

145. 【2602.01116】Logic-Oriented Retriever Enhancement via Contrastive Learning

链接https://arxiv.org/abs/2602.01116

作者:Wenxuan Zhang,Yuan-Hao Jiang,Changyong Qi,Rui Jia,Yonghe Wu

类目:Computation and Language (cs.CL)

关键词:Large language models, queries involving complex, Large language, complex logical relations, involving complex logical

备注: accepted by icassp 2026

点击查看摘要

Abstract:Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at this https URL.

146. 【2602.01070】What If We Allocate Test-Time Compute Adaptively?

链接https://arxiv.org/abs/2602.01070

作者:Ahsan Bilal,Ahmed Mohsin,Muhammad Umer,Ali Subhan,Hassan Rizwan,Ayesha Mohsin,Dean Hougen

类目:Computation and Language (cs.CL)

关键词:fixed sampling strategies, sampling strategies, inference computation uniformly, fixed sampling, applies verification

备注

点击查看摘要

Abstract:Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

147. 【2602.01068】From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

链接https://arxiv.org/abs/2602.01068

作者:Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, development of Large, Language Models, rapid development

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.

148. 【2602.01064】Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

链接https://arxiv.org/abs/2602.01064

作者:Ruihan Jin,Pengpeng Shao,Zhengqi Wen,Jinyang Wu,Mingkuan Feng,Shuo Yang,Chu Yuan Zhang,Jianhua Tao

类目:Computation and Language (cs.CL)

关键词:stronger large language, large language models, stronger large, large language, Knowledge

备注

点击查看摘要

Abstract:Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.

149. 【2602.01063】Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents

链接https://arxiv.org/abs/2602.01063

作者:Bin Han,Deuksin Kwon,Jonathan Gratch

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, explicit personality prompts, conditioned with explicit

备注

点击查看摘要

Abstract:Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM-based dialogue agents: whether such variations reflect inconsistency or context-sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context-sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.

150. 【2602.01058】Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

链接https://arxiv.org/abs/2602.01058

作者:Dylan Zhang,Yufeng Xu,Haojin Wang,Qingzhi Chen,Hao Peng

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:offline SFT stage, SFT, process that typically, typically consists, online reinforcement learning

备注

点击查看摘要

Abstract:Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.01058 [cs.LG]

(or
arXiv:2602.01058v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.01058

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Dylan Zhang [view email] [v1]
Sun, 1 Feb 2026 06:53:45 UTC (303 KB)

151. 【2602.01034】Discovering Process-Outcome Credit in Multi-Step LLM Reasoning

链接https://arxiv.org/abs/2602.01034

作者:Xiangwei Wang,Wei Wang,Ken Chen,Nanduni Nimalsiri,Saman Halgamuge

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Reinforcement Learning, standard outcome-based approaches, Large Language, inefficient credit assignment

备注

点击查看摘要

Abstract:Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.

152. 【2602.01031】HalluHard: A Hard Multi-Turn Hallucination Benchmark

链接https://arxiv.org/abs/2602.01031

作者:Dongyang Fan,Sebastien Delsad,Nicolas Flammarion,Maksym Andriushchenko

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, ungrounded factual claims, early errors cascade, produce plausible-sounding

备注

点击查看摘要

Abstract:Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $\textbf{HalluHard}$, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ($\approx 30\%$ for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.

153. 【2602.01030】Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

链接https://arxiv.org/abs/2602.01030

作者:Sheng-Lun Wei,Yu-Ling Liao,Yen-Hua Chang,Hen-Hsen Huang,Hsin-Hsi Chen

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Global MMLU Lite, work presents, systematic investigation, bias in multilingual, MMLU Lite

备注: Accepted as a long findings paper at EACL 2026

点击查看摘要

Abstract:This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' $\kappa$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at this https URL.

154. 【2602.01015】Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

链接https://arxiv.org/abs/2602.01015

作者:Conrad Borchers,Jill-Jênn Vie,Roger Azevedo

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, Large language, increasingly embedded, embedded in AI-based, language models

备注: Manuscript under review

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.

155. 【2602.01007】Distilling Token-Trained Models into Byte-Level Models

链接https://arxiv.org/abs/2602.01007

作者:Zishuo Bao,Jiaqi Leng,Junxiong Wang,Bowen Peng,Yucheng Lu

类目:Computation and Language (cs.CL)

关键词:scaling language models, scaling language, Language Models, Byte Language Models, promising direction

备注: 17 pages, 3 figures, 13 tables

点击查看摘要

Abstract:Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models' performance using only approximately 125B bytes.

156. 【2602.00998】Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

链接https://arxiv.org/abs/2602.00998

作者:Zhikun Xu,Xiaodong Yu,Ben Zhou,Jiang Liu,Jialian Wu,Ze Wang,Ximeng Sun,Hao Chen,Zicheng Liu

类目:Computation and Language (cs.CL)

关键词:Recent large language, Recent large, large language models, perform strongly, validating assumptions

备注

点击查看摘要

Abstract:Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.

157. 【2602.00997】Error Taxonomy-Guided Prompt Optimization

链接https://arxiv.org/abs/2602.00997

作者:Mayank Singh,Vikas Yadav,Eduardo Blanco

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Automatic Prompt Optimization, Automatic Prompt, modifying their weights, Prompt Optimization, extracting performance

备注

点击查看摘要

Abstract:Automatic Prompt Optimization (APO) is a powerful approach for extracting performance from large language models without modifying their weights. Many existing methods rely on trial-and-error, testing different prompts or in-context examples until a good configuration emerges, often consuming substantial compute. Recently, natural language feedback derived from execution logs has shown promise as a way to identify how prompts can be improved. However, most prior approaches operate in a bottom-up manner, iteratively adjusting the prompt based on feedback from individual problems, which can cause them to lose the global perspective. In this work, we propose Error Taxonomy-Guided Prompt Optimization (ETGPO), a prompt optimization algorithm that adopts a top-down approach. ETGPO focuses on the global failure landscape by collecting model errors, categorizing them into a taxonomy, and augmenting the prompt with guidance targeting the most frequent failure modes. Across multiple benchmarks spanning mathematics, question answering, and logical reasoning, ETGPO achieves accuracy that is comparable to or better than state-of-the-art methods, while requiring roughly one third of the optimization-phase token usage and evaluation budget.

158. 【2602.00996】DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

链接https://arxiv.org/abs/2602.00996

作者:Abhijit Chakraborty,Ashish Raj Shekhar,Shiven Agarwal,Vivek Gupta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:diverse information sources, images requires integrating, requires integrating diverse, integrating diverse information, Complex question answering

备注

点击查看摘要

Abstract:Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.

159. 【2602.00986】Sparse Reward Subsystem in Large Language Models

链接https://arxiv.org/abs/2602.00986

作者:Guowei Xu,Mert Yuksekgonul,James Zou

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, sparse reward subsystem, biological reward subsystem

备注

点击查看摘要

Abstract:In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model's internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.

160. 【2602.00983】DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

链接https://arxiv.org/abs/2602.00983

作者:Batuhan K. Karaman,Aditya Rawal,Suhaila Shakiah,Mohammad Ghavamzadeh,Mingyi Hong,Arijit Biswas,Ruida Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Reinforcement learning, large language models, verifiable rewards, rewards has emerged, promising paradigm

备注: This work is accepted to the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights 1 increase the average token entropy (i.e., exploration) while weights 1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights 1) or vanishing response lengths (when weights 1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.

161. 【2602.00981】MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA

链接https://arxiv.org/abs/2602.00981

作者:Yutong Song,Shiva Shrestha,Chenhan Lyu,Elahe Khatibi,Pengfei Zhang,Honghui Xu,Nikil Dutt,Amir Rahmani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Spoken question-answering, recognizing medical terminology, automatic speech recognition, accurately recognizing medical, systems relying

备注

点击查看摘要

Abstract:Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at this https URL.

162. 【2602.00979】GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability

链接https://arxiv.org/abs/2602.00979

作者:Xueyi Li,Zhuoneng Zhou,Zitao Liu,Yongdong Wu,Weiqi Luo

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, significantly boosting student, demonstrated remarkable potential, boosting student assessment, student assessment efficiency

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential for automatic short answer grading (ASAG), significantly boosting student assessment efficiency and scalability in educational scenarios. However, their vulnerability to adversarial manipulation raises critical concerns about automatic grading fairness and reliability. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the vulnerability of LLM based ASAG models. Specifically, we align general-purpose attack methods with the specific objectives of ASAG by designing token-level and prompt-level strategies that manipulate grading outcomes while maintaining high camouflage. Furthermore, to quantify attack camouflage, we propose a novel evaluation metric that balances attack success and camouflage. Experiments on multiple datasets demonstrate that both attack strategies effectively mislead grading models, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior camouflage capability. Our findings underscore the need for robust defenses to ensure fairness and reliability in ASAG. Our code and datasets are available at this https URL.

163. 【2602.00977】rust in One Round: Confidence Estimation for Large Language Models via Structural Signals

链接https://arxiv.org/abs/2602.00977

作者:Pengyue Yang,Jiawen Wen,Haolin Jin,Linghan Huang,Huaming Chen,Ling Chen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:carry high social, Large language models, errors carry high, Large language, high social

备注: Accepted at The ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs. Yet standard confidence estimators, such as token likelihood, semantic similarity and multi-sample consistency, remain brittle under distribution shift, domain-specialised text, and compute limits. In this work, we present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction based on multi-scale structural signals derived from a model's final-layer hidden-state trajectory. By combining spectral, local-variation, and global shape descriptors, our method captures internal stability patterns that are missed by probabilities and sentence embeddings. We conduct extensive, cross-domain evaluation across four heterogeneous benchmarks-FEVER (fact verification), SciFact (scientific claims), WikiBio-hallucination (biographical consistency), and TruthfulQA (truthfulness-oriented QA). Our Structural Confidence framework demonstrates strong performance compared with established baselines in terms of AUROC and AUPR. More importantly, unlike sampling-based consistency methods which require multiple stochastic generations and an auxiliary model, our approach uses a single deterministic forward pass, offering a practical basis for efficient, robust post-hoc confidence estimation in socially impactful, resource-constrained LLM applications.

164. 【2602.00970】Verification Required: The Impact of Information Credibility on AI Persuasion

链接https://arxiv.org/abs/2602.00970

作者:Saaduddin Mahmud,Eugene Bagdasarian,Shlomo Zilberstein

类目:Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)

关键词:shapes high-stakes decisions, large language models, strategic communication essential, high-stakes decisions, making a principled

备注: 19 pages, 5 figures

点击查看摘要

Abstract:Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.

165. 【2602.00959】Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

链接https://arxiv.org/abs/2602.00959

作者:Yuheng Yang,Siqi Zhu,Tao Feng,Ge Liu,Jiaxuan You

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, compressed knowledge bases, knowledge boundaries extend, Language Models

备注: Homepage: [this https URL](https://ulab-uiuc.github.io/KnowledgeExtraction/)

点击查看摘要

Abstract:Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove exact duplicates, LLM-based adjudication to resolve ambiguous semantic overlaps, and domain-relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1-versus-Pass@k trade-off: domain-specialized models achieve higher initial accuracy but degrade rapidly, while general-purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.

166. 【2602.00945】Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

链接https://arxiv.org/abs/2602.00945

作者:Anusa Saha,Tanmay Joshi,Vinija Jain,Aman Chadha,Amitava Das

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reflecting English language, LLMs are multilingual, multilingual by training, dominance in pretraining, English language dominance

备注

点击查看摘要

Abstract: LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.00945 [cs.CL]

(or
arXiv:2602.00945v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.00945

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
167. 【2602.00914】A Baseline Multimodal Approach to Emotion Recognition in Conversations

链接https://arxiv.org/abs/2602.00914

作者:Víctor Yeste,Rodrigo Rivas-Arévalo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:sitcom Friends, dataset built, lightweight multimodal baseline, present a lightweight, emotion recognition

备注: 10 pages

点击查看摘要

Abstract:We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.

168. 【2602.00913】Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts

链接https://arxiv.org/abs/2602.00913

作者:Víctor Yeste,Paolo Rosso

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Schwartz higher-order, Schwartz, classification over Schwartz, unclear whether Schwartz, typically framed

备注: Code: [this https URL](https://github.com/VictorMYeste/human-value-detection) , 42 pages, 4 figures

点击查看摘要

Abstract:Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO$\rightarrow$values pipelines that enforce the hierarchy with hard masks, and (iii) Presence$\rightarrow$HO$\rightarrow$values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ($\le$10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro-$F_1\approx0.58$), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro-$F_1$ via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to $+0.05$ Macro-$F_1$), and small transformer ensembles provide the most consistent additional gains (up to $+0.02$ Macro-$F_1$). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.

169. 【2602.00906】Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

链接https://arxiv.org/abs/2602.00906

作者:Anxin Guo,Jingwei Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)

关键词:Large language models, lack inferable patterns, Large language, inferable patterns, language models

备注

点击查看摘要

Abstract:Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified "closed world" setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

170. 【2602.00887】EffGen: Enabling Small Language Models as Capable Autonomous Agents

链接https://arxiv.org/abs/2602.00887

作者:Gaurav Srivastava,Aafiya Hussain,Chi Wang,Yingyan Celine Lin,Xuan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:API calls, existing language model, agentic systems today, today are built, language model agentic

备注

点击查看摘要

Abstract:Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (this https URL) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at this https URL.

171. 【2602.00881】ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople

链接https://arxiv.org/abs/2602.00881

作者:Shounak Paul,Raghav Dogra,Pawan Goyal,Saptarshi Ghosh

类目:Computation and Language (cs.CL)

关键词:Legal Statute Identification, Legal NLP, Statute Identification, Legal, Legal Statute

备注: 9 Pages of Main, 1 page of Limitations and Ethics Statement, 11 Pages of Appendix, Accepted for Publication at EACL 2026 (Findings)

点击查看摘要

Abstract:Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.

172. 【2602.00871】Beyond Output Critique: Self-Correction via Task Distillation

链接https://arxiv.org/abs/2602.00871

作者:Hossein A. Rahmani,Mengting Wan,Pei Zhou,Longqi Yang,Nick Craswell,Emine Yilmaz,Sujay Kumar Jauhar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:promising self-correction abilities, shown promising self-correction, shown promising, self-correction abilities, iterative refinement improves

备注

点击查看摘要

Abstract:Large language models (LLMs) have shown promising self-correction abilities, where iterative refinement improves the quality of generated responses. However, most existing approaches operate at the level of output critique, patching surface errors while often failing to correct deeper reasoning flaws. We propose SELF-THOUGHT, a framework that introduces an intermediate step of task abstraction before solution refinement. Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure. This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task and reducing error propagation. Crucially, we show that these abstractions can be transferred across models: templates generated by larger models can serve as structured guides for smaller LLMs, which typically struggle with intrinsic self-correction. By reusing distilled task structures, smaller models achieve more reliable refinements without heavy fine-tuning or reliance on external verifiers. Experiments across diverse reasoning tasks demonstrate that SELF-THOUGHT improves accuracy, robustness, and generalization for both large and small models, offering a scalable path toward more reliable self-correcting language systems.

173. 【2602.00866】Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data

链接https://arxiv.org/abs/2602.00866

作者:Akiharu Esashi,Pawissanutt Lertpongrujikorn,Justin Makino,Yuibi Fujimoto,Mohsen Amini Salehi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Controller Area Network, Area Network, including collision detection, Controller Area, vehicular signals increasingly

备注

点击查看摘要

Abstract:The Controller Area Network (CAN) bus provides a rich source of vehicular signals increasingly leveraged for applications in automotive and auto insurance domains, including collision detection, predictive maintenance, and driver risk modeling. Despite this potential, existing pipelines largely train isolated task-specific models on raw CAN data, with only limited efforts exploring decoded signals. Such fragmentation prevents shared representation learning and limits cross-task generalization. By contrast, natural language processing (NLP) and computer vision (CV) have been transformed by the foundation model paradigm: large-scale pretraining followed by task-specific adaptation. In this work, we introduce the foundation CAN model that demonstrates multi-objective downstream generalization using a single pretrained backbone. Our approach treats CAN data as a language: we pretrain on large-scale, unlabeled decoded CAN signals and fine-tune across heterogeneous auto insurance tasks. To enable this, we propose a unified tokenization scheme for mixed discrete-continuous signals and address challenges of temporal complexity and trip-specific variability. Our results show that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm, proven in NLP and CV, also holds for CAN data. This establishes a new direction for generalizable representation learning in automotive AI.

174. 【2602.00861】Multi-Head Attention Is a Multi-Player Game

链接https://arxiv.org/abs/2602.00861

作者:Kushal Chakrabarti,Nirmal Balachundar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)

关键词:Modern transformer attention, Modern transformer, internally multi-agent, compete and coordinate, monolithic optimizer

备注

点击查看摘要

Abstract:Modern transformer attention is internally multi-agent -- heads compete and coordinate -- yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by $\Gamma(G)$, the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces $\Gamma(G)$ provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins decorrelation with log-determinant coordination pressure. Experiments validate the theory: $\Gamma(G)$ predicts hallucination ($p{}0.05$), emergent coalitions exhibit selective coordination, and GAME-LoRA achieves up to 18\% hallucination reduction (8\% average) with no knowledge degradation -- a Pareto improvement inaccessible to methods ignoring the game structure.

175. 【2602.00857】Unifying Adversarial Robustness and Training Across Text Scoring Models

链接https://arxiv.org/abs/2602.00857

作者:Manveer Singh Tamber,Hosna Oyarhoseini,Jimmy Lin

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:obscuring shared vulnerabilities, obscuring shared, shared vulnerabilities, fragmented across applications, text scoring

备注

点击查看摘要

Abstract:Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.

176. 【2602.00848】Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation

链接https://arxiv.org/abs/2602.00848

作者:Ziwei Gong,Yanda Chen,Julia Hirschberg,Chen Zhao,He He,Zhou Yu,Kathleen Mckeown

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, encode knowledge, degrees of confidence, knowledge with varying

备注

点击查看摘要

Abstract:Large language models (LLMs) encode knowledge with varying degrees of confidence. When responding to queries, models face an inherent trade-off: they can generate responses that are less informative but highly factual, or more informative but potentially less accurate. Different applications demand different balances between informativeness and factuality. We introduce Factuality-Controlled Generation (FCG), a framework that enables users to specify factuality constraints alongside their queries. We propose to evaluate FCG performance on two dimensions: adherence to factuality constraints and response informativeness. We propose to train models on the FCG task using synthetic data, and show that our synthetic training significantly improves models' ability to both respect factuality requirements and maintain informativeness in their outputs.

177. 【2602.00846】Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

链接https://arxiv.org/abs/2602.00846

作者:Zicheng Kong,Dehua Ma,Zhenbo Xu,Alven Yang,Yiwei Ru,Haoran Wang,Zixuan Zhou,Fuqing Bie,Liuyu Xiang,Huijia Wu,Jian Zhao,Zhaofeng He

类目:Computation and Language (cs.CL)

关键词:Multimodal large language, Multimodal large, existing alignment techniques, shown remarkable capabilities, large language models

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at this https URL.

178. 【2602.00793】SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality

链接https://arxiv.org/abs/2602.00793

作者:Yoonsang Kim,Devshree Jadeja,Divyansh Pradhan,Yalong Yang,Arie Kaufman

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)

关键词:day creates unnecessary, Speaking aloud, creates unnecessary effort, wearable AR assistant, requests every day

备注: 11 pages, 9 figures. This is the author's version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026

点击查看摘要

Abstract:Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users "speak less," while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.

179. 【2602.00777】HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

链接https://arxiv.org/abs/2602.00777

作者:Xuan Ai,Qingqing Yang,Peng Wang,Lei Deng,Lin Zhang,Renhai Chen,Gong Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, substantial memory footprint, footprint of Key-Value

备注

点击查看摘要

Abstract:Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top-$k$ indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6\%--46\% while maintaining comparable performance (with $1\%$ accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \href{this https URL}{\texttt{/r/unified-cache-management-CF80/}}

180. 【2602.00770】Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models

链接https://arxiv.org/abs/2602.00770

作者:Siyuan Zhang,Jialian Li,Yichi Zhang,Xiao Yang,Yinpeng Dong,Hang Su

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, achieved remarkable performance, Language Models, motivating research

备注: 30 pages, 27 figures, 8 tables

点击查看摘要

Abstract:Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model's internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.

181. 【2602.00769】Eliciting Trustworthiness Priors of Large Language Models via Economic Games

链接https://arxiv.org/abs/2602.00769

作者:Siyu Yan,Lusha Zhu,Jian-Qiao Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:trustworthy artificial intelligence, maintaining calibrated trust, automation bias, Trust Game, trustworthy artificial

备注

点击查看摘要

Abstract:One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.

182. 【2602.00762】WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs

链接https://arxiv.org/abs/2602.00762

作者:Yuheng Shao,Junjie Xiong,Chaoran Wu,Xiyuan Wang,Ziyu Zhou,Yang Ouyang,Qinyi Tao,Quan Li

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:remains a significant, significant challenge, English learners, Large Language Models, keyword method

备注: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI' 26), April 13--17, 2026, Barcelona, Spain

点击查看摘要

Abstract:Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.

183. 【2602.00760】APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

链接https://arxiv.org/abs/2602.00760

作者:Kaiyan Chang,Chenwei Zhu,Yingfeng Luo,Yifu Huo,Chenglong Wang,Xiaoqian Liu,Qiaozhi He,Tong Xiao,Zhengtao Yu,Jingbo Zhu

类目:Computation and Language (cs.CL)

关键词:Large Reasoning Models, Test-Time Scaling, capabilities of Large, Large Reasoning, enhanced the capabilities

备注: Under Review

点击查看摘要

Abstract:Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring significantly fewer computational resources for RL training.

184. 【2602.00759】Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

链接https://arxiv.org/abs/2602.00759

作者:Zhipeng Chen,Xiaobo Qin,Wayne Xin Zhao,Youbin Wu,Ji-Rong Wen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:shown great potential, Reinforcement learning, large language models, Adaptive Ability Decomposing, RLVR process

备注: 21 pages, Working in progress

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.

185. 【2602.00758】mporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting

链接https://arxiv.org/abs/2602.00758

作者:Ali El Lahib,Ying-Jieh Xia,Zehan Li,Yuxuan Wang,Xinyu Pi

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Search-engine date filters, Search-engine date, enforce pre-cutoff retrieval, search-augmented forecasters, enforce pre-cutoff

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.

186. 【2602.00747】Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

链接https://arxiv.org/abs/2602.00747

作者:Shengrui Li,Fei Zhao,Kaiyan Zhao,Jieying Ye,Haifeng Liu,Fangcheng Shi,Zheyong Xie,Yao Hu,Shaosheng Cao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, Large Language, balance general competence, Determining an effective, factor in Large

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at this https URL.

187. 【2602.00742】CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

链接https://arxiv.org/abs/2602.00742

作者:Liang Wang,Xinyi Mou,Xiaoyou Liu,Xuanjing Huang,Zhongyu Wei

类目:Computation and Language (cs.CL)

关键词:Large Language Models, modeling characterizes individuals, Large Language, User modeling characterizes, enable personalized simulation

备注

点击查看摘要

Abstract:User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2\% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at this https URL

188. 【2602.00740】ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement

链接https://arxiv.org/abs/2602.00740

作者:Ziyan Xiao,Yinghao Zhu,Liang Peng,Lequan Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains difficult due, limited high-quality data, Clinical text improvement, medical documentation, Large Language Models

备注

点击查看摘要

Abstract:Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns "how to revise" rather than just "what to revise". Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.

189. 【2602.00733】EchoReview: Learning Peer Review from the Echoes of Scientific Citations

链接https://arxiv.org/abs/2602.00733

作者:Yinuo Zhang,Dingcheng Huang,Haifeng Suo,Yizhuo Li,Ziya Zhao,Junhao Xu,Zhiying Tu,Dianhui Chu,Deming Zhai,Xianming Liu,Xiaoyan Yu,Dianbo Sui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:unprecedented scalability pressures, facing unprecedented scalability, scientific submissions continues, automated reviewing methods, grow rapidly

备注

点击查看摘要

Abstract:As the volume of scientific submissions continues to grow rapidly, traditional peer review systems are facing unprecedented scalability pressures, highlighting the urgent need for automated reviewing methods that are both scalable and reliable. Existing supervised fine-tuning approaches based on real review data are fundamentally constrained by single-source of data as well as the inherent subjectivity and inconsistency of human reviews, limiting their ability to support high-quality automated reviewers. To address these issues, we propose EchoReview, a citation-context-driven data synthesis framework that systematically mines implicit collective evaluative signals from academic citations and transforms scientific community's long-term judgments into structured review-style data. Based on this pipeline, we construct EchoReview-16K, the first large-scale, cross-conference, and cross-year citation-driven review dataset, and train an automated reviewer, EchoReviewer-7B. Experimental results demonstrate that EchoReviewer-7B can achieve significant and stable improvements on core review dimensions such as evidence support and review comprehensiveness, validating citation context as a robust and effective data paradigm for reliable automated peer review.

190. 【2602.00699】From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development

链接https://arxiv.org/abs/2602.00699

作者:Xuan Liu,Ziyu Li,Mu He,Ziyang Ma,Xiaoxu Wu,Gizem Yilmaz,Yiyuan Xia,Bingbing Li,He Tan,Jerry Ying Hsi Fuh,Wen Feng Lu,Anders E.W. Jarfors,Per Jansson

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Ontologies are essential, structuring domain knowledge, improving accessibility, Large Language Models, essential for structuring

备注: 11 pages,8 figures,3 tables,presented at International Conference on Industry of the Future and Smart Manufacturing,2025

点击查看摘要

Abstract:Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse. However, traditional ontology construction relies on manual annotation and conventional natural language processing (NLP) techniques, making the process labour-intensive and costly, especially in specialised fields like casting manufacturing. The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction. This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts using limited data. We compare their performances and use the best-performing method to build a casting ontology that validated by domian expert.

191. 【2602.00665】Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

链接https://arxiv.org/abs/2602.00665

作者:Lakshan Cooray,Deshan Sumanathilaka,Pattigadapa Venkatesh Raju

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Customer-service question answering, Large Language Models, systems increasingly rely, question answering, Language Models

备注: Submission is under review with Computational Linguistics

点击查看摘要

Abstract:Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.

192. 【2602.00642】LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

链接https://arxiv.org/abs/2602.00642

作者:Haitao Li,Yifan Chen,Shuo Miao,Qian Dong,Jia Chen,Yiran Hu,Junjie Chen,Minghao Qin,Qingyao Ai,Yiqun Liu,Cheng Luo,Quan Zhou,Ya Zhang,Jikun Hu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated impressive general, performing rigorous multi-step, Language Models

备注: 25 pages, v1

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.

193. 【2602.00638】Formal Semantic Control over Language Models

链接https://arxiv.org/abs/2602.00638

作者:Yingji Zhang

类目:Computation and Language (cs.CL)

关键词:latent space geometry, latent space, advances semantic representation, thesis advances semantic, semantic representation learning

备注

点击查看摘要

Abstract:This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.

194. 【2602.00628】From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs

链接https://arxiv.org/abs/2602.00628

作者:Louis Schiekiera,Max Zimmer,Christophe Roux,Sebastian Pokutta,Fritz Günther

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM hidden-state geometry, psycholinguistic experiments, LLM hidden-state, investigate the extent, LLM

备注: 25 pages including references, 15 figures, 6 tables

点击查看摘要

Abstract:We investigate the extent to which an LLM's hidden-state geometry can be recovered from its behavior in psycholinguistic experiments. Across eight instruction-tuned transformer models, we run two experimental paradigms -- similarity-based forced choice and free association -- over a shared 5,000-word vocabulary, collecting 17.5M+ trials to build behavior-based similarity matrices. Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus. We find that forced-choice behavior aligns substantially more with hidden-state geometry than free association. In a held-out-words regression, behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating that behavior-only measurements retain recoverable information about internal semantic geometry. Finally, we discuss implications for the ability of behavioral tasks to uncover hidden cognitive states.

195. 【2602.00619】Jailbreaking LLMs via Calibration

链接https://arxiv.org/abs/2602.00619

作者:Yuxuan Lu,Yongkang Guo,Yuqing Kong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Large Language Models, underlying pre-aligned data, Large Language, pre-aligned data distribution, model aligned output

备注

点击查看摘要

Abstract:Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower "Jailbreak Tax" compared with existing methods, especially on the safety-hardened gpt-oss-120b.

196. 【2602.00613】ransformer-Based Model for Multilingual Hope Speech Detection

链接https://arxiv.org/abs/2602.00613

作者:Nsrin Ashraf,Mariam Labib,Hamada Nayel

类目:Computation and Language (cs.CL)

关键词:paper describes, English, English and Germany, implemented, English and German

备注: 5 pages, 1 figure, PolyHope-M shared task at RANLP2025

点击查看摘要

Abstract:This paper describes a system that has been submitted to the "PolyHope-M" at RANLP2025. In this work various transformers have been implemented and evaluated for hope speech detection for English and Germany. RoBERTa has been implemented for English, while the multilingual model XLM-RoBERTa has been implemented for both English and German languages. The proposed system using RoBERTa reported a weighted f1-score of 0.818 and an accuracy of 81.8% for English. On the other hand, XLM-RoBERTa achieved a weighted f1-score of 0.786 and an accuracy of 78.5%. These results reflects the importance of improvement of pre-trained large language models and how these models enhancing the performance of different natural language processing tasks.

197. 【2602.00612】Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars

链接https://arxiv.org/abs/2602.00612

作者:Yitong Zhang,Yongmin Li,Yuetong Liu,Jia Li,Xiaoran Jia,Zherui Li,Ge Li

类目:Computation and Language (cs.CL)

关键词:Diffusion Large Language, Diffusion Large, Large Language Models, demonstrated promising generative, promising generative capabilities

备注

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have demonstrated promising generative capabilities and are increasingly used to produce formal languages defined by context-free grammars, such as source code and chemical expressions. However, as probabilistic models, they still struggle to generate syntactically valid outputs reliably. A natural and promising direction to address this issue is to adapt constrained decoding techniques to enforce grammatical correctness during generation. However, applying these techniques faces two primary obstacles. On the one hand, the non-autoregressive nature of dLLMs renders most existing constrained decoding approaches inapplicable. On the other hand, current approaches specifically designed for dLLMs may allow intermediate outputs that are impossible to complete into valid sentences, which significantly limits their reliability in practice. To address these challenges, we present LAVE, a constrained decoding approach specifically designed for dLLMs. Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass. Whenever a new token is proposed by model, LAVE performs lookahead using these distributions to efficiently and reliably verify the validity of the proposed token. This design ensures reliable constraints by reliably preserving the potential for intermediate outputs to be extended into valid sentences. Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.00612 [cs.CL]

(or
arXiv:2602.00612v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.00612

Focus to learn more

              arXiv-issued DOI via DataCite</p>
198. 【2602.00597】Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

链接https://arxiv.org/abs/2602.00597

作者:Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Interlingual subtitling, visual media, essential for entertainment, entertainment localization

备注: Accepted to The Web Conference (WWW) 2026

点击查看摘要

Abstract:Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.

199. 【2602.00594】Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

链接https://arxiv.org/abs/2602.00594

作者:Zhijie Huang,Stephen McIntosh,Daisuke Saito,Nobuaki Minematsu

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:good language model, language model starts, good language, language model, model starts

备注

点击查看摘要

Abstract:A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.

200. 【2602.00588】he French Drama Revolution: Political Economy and Literary Production, 1700-1900

链接https://arxiv.org/abs/2602.00588

作者:Thiago Dumont Oliveira

类目:Computation and Language (cs.CL)

关键词:Latent Dirichlet Allocation, Latent Dirichlet, Dirichlet Allocation, Jensen-Shannon Divergence, Allocation and Jensen-Shannon

备注

点击查看摘要

Abstract:This paper investigates the changing nature of French drama between 1700-1900 using Latent Dirichlet Allocation and Jensen-Shannon Divergence. Results indicate that the topical distribution of French drama changed profoundly after the French Revolution, particularly between 1789 and 1850. Bourgeois themes emerged among the most prevalent topics since the late 18th century. To assess the coevolution of drama and economic growth, I plot the yearly prevalence of topics alongside French GDP between 1700-1900, and discuss these changes in light of the political and economic changes prompted by the French Revolution and the industrialization of the country.

201. 【2602.00564】Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

链接https://arxiv.org/abs/2602.00564

作者:Xiang Zheng,Weiqi Zhai,Wei Wang,Boyu Yang,Wenbo Li,Ruixiang Luo,Haoxiang Sun,Yucheng Wang,Zhengze Li,Meng Wang,Yuetian Du,Guojie Lin,Yaxuan Wang,Xiaoxiao Xu,Yanhu Mo,Xuan Ren,Hu Wei,Ze Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent large language, Recent large, achieve near-saturation accuracy, genuine reasoning competence, large language models

备注: 8 pages, and 3 figures

点击查看摘要

Abstract:Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.

202. 【2602.00554】A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora

链接https://arxiv.org/abs/2602.00554

作者:Liu Kaipeng,Wu Ling

类目:Computation and Language (cs.CL)

关键词:Bidirectional Encoder Representations, Argument Structure Constructions, Transformers model processes, fundamental Argument Structure, Bidirectional Encoder

备注: 14 pages, 5 figures

点击查看摘要

Abstract:This study investigates how the Bidirectional Encoder Representations from Transformers model processes four fundamental Argument Structure Constructions. We employ a multi-dimensional analytical framework, which integrates MDS, t-SNE as dimensionality reduction, Generalized Discrimination Value (GDV) as cluster separation metrics, Fisher Discriminant Ratio (FDR) as linear diagnostic probing, and attention mechanism analysis. Our results reveal a hierarchical representational structure. Construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.

203. 【2602.00543】Reasoning by Commented Code for Table Question Answering

链接https://arxiv.org/abs/2602.00543

作者:Seho Pyo,Jiheon Seok,Jaejin Lee

类目:Computation and Language (cs.CL)

关键词:Table Question Answering, Question Answering, two-dimensional relationships intrinsic, Table Question, poses a significant

备注

点击查看摘要

Abstract:Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9\% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6\%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3\% accuracy on the WikiTableQuestions benchmark.

204. 【2602.00497】Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design

链接https://arxiv.org/abs/2602.00497

作者:Hanjing Shi,Dominic DiFranzo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:largely assume English-centric, assume English-centric data, homogeneous user populations, frameworks largely assume, assume English-centric

备注

点击查看摘要

Abstract:Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.

205. 【2602.00491】From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas

链接https://arxiv.org/abs/2602.00491

作者:Zhaokun Yan,Zhaohan Liu,Wuzheng Dong,Lijie Feng,Chengxiao Dai

类目:Computation and Language (cs.CL)

关键词:requires population level, population level inference, level inference grounded, reasoning requires population, Public health reasoning

备注

点击查看摘要

Abstract:Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbf{GlobalHealthAtlas}, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.

206. 【2602.00477】Intention-Adaptive LLM Fine-Tuning for Text Revision Generation

链接https://arxiv.org/abs/2602.00477

作者:Zhexiong Liu,Diane Litman

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, tasks remain underexplored, achieved impressive capabilities, generation tasks remain

备注: In the Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer's actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.

207. 【2602.00476】Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly

链接https://arxiv.org/abs/2602.00476

作者:Hengchang Liu,Zhao Yang,Bing Su

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Diffusion language models, bidirectional generation framework, generation framework naturally, framework naturally suited, Diffusion language

备注

点击查看摘要

Abstract:Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre-specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first-step denoising confidence: a local \textit{Oracle Peak} that emerges near the ground-truth length and a systematic \textit{Length Bias} that often obscures this signal. By leveraging this signal and calibrating the bias, our training-free method \textbf{CAL} (\textbf{C}alibrated \textbf{A}daptive \textbf{L}ength) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7\% over fixed-length baselines and 40.5\% over chat-based adaptive methods in code infilling, while boosting BLEU-2 and ROUGE-L by up to 8.5\% and 9.9\% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at this https URL.

208. 【2602.00469】Words that make SENSE: Sensorimotor Norms in Learned Lexical Token Representations

链接https://arxiv.org/abs/2602.00469

作者:Abhinav Gupta,Toben H. Mintz,Jesse Thomason

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:embeddings derive meaning, human language understanding, word embeddings derive, textbf, motor experience

备注: 5 pages, 2 figures, codebase can be found at: [this https URL](https://github.com/abhinav-usc/SENSE-model/tree/main)

点击查看摘要

Abstract:While word embeddings derive meaning from co-occurrence patterns, human language understanding is grounded in sensory and motor experience. We present $\text{SENSE}$ $(\textbf{S}\text{ensorimotor }$ $\textbf{E}\text{mbedding }$ $\textbf{N}\text{orm }$ $\textbf{S}\text{coring }$ $\textbf{E}\text{ngine})$, a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and $\text{SENSE}$ ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.

209. 【2602.00459】What Matters to an LLM? Behavioral and Computational Evidences from Summarization

链接https://arxiv.org/abs/2602.00459

作者:Yongxin Zhou,Changshun Wu,Philippe Mulhem,Didier Schwab,Maxime Peyrard

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, selections remains hidden, Language Models, remains hidden

备注: Findings of EACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.

210. 【2602.00428】When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

链接https://arxiv.org/abs/2602.00428

作者:Naen Xu,Hengyu An,Shuo Shi,Jinghuai Zhang,Chunyi Zhou,Changjiang Li,Tianyu Du,Zhihui Fu,Jun Wang,Shouling Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:large language models, address complex challenges, Recent advancements, Mandela effect, multi-agent systems

备注: ICLR 2026

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.

211. 【2602.00426】LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

链接https://arxiv.org/abs/2602.00426

作者:Vikram Krishnamurthy

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)

关键词:underlying computational structure, Large language models, Large language, based on transformer, obscuring their underlying

备注: 27 pages, 12 figures. Mathematical survey framing LLMs as high-dimensional nonlinear autoregressive models with attention, covering training, alignment, and inference, with nanoGPT/nanochat-style code examples. Feedback welcome

点击查看摘要

Abstract:Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear--softmax--linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.

212. 【2602.00425】Segment-Level Attribution for Selective Learning of Long Reasoning Traces

链接https://arxiv.org/abs/2602.00425

作者:Siyuan Wang,Yanchen Liu,Xiang Ren

类目:Computation and Language (cs.CL)

关键词:Large Reasoning Models, achieve strong reasoning, generating long chains, traces meaningfully contributes, Large Reasoning

备注: Accepted to ICLR 2026. 16 pages, 5 figures. Code available at [this https URL](https://github.com/SiyuanWangw/SegmentSelectiveSFT)

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnote{Code and data are available at this https URL}.

213. 【2602.00393】Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

链接https://arxiv.org/abs/2602.00393

作者:Gabriel Bromonschenkel,Alessandro L. Koerich,Thiago M. Paixão,Hilário Tomaz Alves de Oliveira

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:media content generation, social media content, Brazilian Portuguese, natural language descriptions, Image captioning

备注: Accepted to JBCS. 18 pages, 11 figures

点击查看摘要

Abstract:Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: this https URL.

214. 【2602.00380】Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

链接https://arxiv.org/abs/2602.00380

作者:Sercan Karakaş

类目:Computation and Language (cs.CL)

关键词:Turkish reflexive pronouns, large language models, language models capture, large language, study evaluates

备注: 15 pages

点击查看摘要

Abstract:This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.

215. 【2602.00377】DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models

链接https://arxiv.org/abs/2602.00377

作者:Zhaochen Hong,Jiaxuan You

类目:Computation and Language (cs.CL)

关键词:Existing knowledge probing, Existing knowledge, rely on pre-defined, probing methods rely, limiting extraction

备注

点击查看摘要

Abstract:Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse -- divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models useful for their deployment.

216. 【2602.00372】Post-Training Probability Manifold Correction via Structured SVD Pruning and Self-Referential Distillation

链接https://arxiv.org/abs/2602.00372

作者:Aaron R. Flouro,Shawn P. Chadwick

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, Sparse Knowledge Distillation, expensive to deploy, introduce Sparse Knowledge

备注: 16 pages, 10 tables, 4 figures

点击查看摘要

Abstract:Large language models are expensive to deploy. We introduce Sparse Knowledge Distillation (SparseKD), a post-training method that compresses transformer models by combining structured SVD pruning with self-referential knowledge distillation. The key insight is simple: instead of using an external teacher, the model teaches itself by matching its own probability distribution from before compression. This self-referential setup enables surprisingly strong quality recovery after aggressive pruning. Our experiments reveal an unexpected finding: self-referential distillation alone, applied post-training under an identical objective and fixed calibration dataset, improves model quality by 39% relative to the original converged checkpoint. When combined with structured pruning, SparseKD achieves 15-65% parameter reduction with acceptable quality trade-offs. Kernel profiling shows that speedups arise entirely from reduced dense matrix multiplication in feed-forward layers while attention remains unchanged, making this approach complementary to attention optimizations. We validate across two model families (0.6B and 3.8B parameters) with multi-seed experiments confirming high reproducibility. SparseKD requires no external super-teacher, no architectural changes, and no custom inference kernels, making it immediately deployable with existing infrastructure.

Comments:
16 pages, 10 tables, 4 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

MSC classes:
68T05

ACMclasses:
I.2.6

Cite as:
arXiv:2602.00372 [cs.LG]

(or
arXiv:2602.00372v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.00372

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
217. 【2602.00352】DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

链接https://arxiv.org/abs/2602.00352

作者:Li Siyan,Darshan Deshpande,Anand Kannappan,Rebecca Qian

类目:Computation and Language (cs.CL)

关键词:Obscure Under-specified Retrieval, information in conversation, people often arrive, multiple turns, recalling information

备注

点击查看摘要

Abstract:When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.

218. 【2602.00344】When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

链接https://arxiv.org/abs/2602.00344

作者:Beidi Zhao,Wenlong Deng,Xinting Liao,Yushu Li,Nazim Shaikh,Yao Nie,Xiaoxiao Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:knowledge-based VQA tasks, enhancing Large Vision-Language, Large Vision-Language Models, recent work attributes, VQA tasks

备注: 18 pages, 10 figures

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

219. 【2602.00319】Detecting AI-Generated Content in Academic Peer Reviews

链接https://arxiv.org/abs/2602.00319

作者:Siyuan Shen,Kai Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:large language models, Nature Communications reviews, academic peer review, Nature Communications, growing availability

备注

点击查看摘要

Abstract:The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.

220. 【2602.00316】MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

链接https://arxiv.org/abs/2602.00316

作者:Rodrigo Batista,Luís Filipe Cunha,Purificação Silvano,Nuno Guimarães,Alípio Jorge,Evelin Amorim,Ricardo Campos

类目:Computation and Language (cs.CL)

关键词:exhibiting heterogeneous formats, local governance, exhibiting heterogeneous, writing styles, Municipal meeting minutes

备注

点击查看摘要

Abstract:Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.

221. 【2602.00300】Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

链接https://arxiv.org/abs/2602.00300

作者:Xilin Gong,Shu Yang,Zehua Cao,Lynne Billard,Di Wang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated strong capabilities, Language Models, hidden representation interpretation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute "purple" for "broccoli", LLMs still generate "green" in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84\% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33\% relative performance improvement.

222. 【2602.00279】Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

链接https://arxiv.org/abs/2602.00279

作者:Philip Müller,Nicholas Popovič,Michael Färber,Peter Steinbach

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, natural sciences, Question Answering, Large Language, question answering setting

备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.

223. 【2602.00250】ABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

链接https://arxiv.org/abs/2602.00250

作者:Shreshth Saini,Avinab Saha,Balu Adsumilli,Neil Birkbeck,Yilin Wang,Alan C. Bovik

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Masked Diffusion Models, offering parallel decoding, bidirectional context utilization, Masked Diffusion, Diffusion Models

备注

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ($O(K)$ forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.

224. 【2602.00238】DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

链接https://arxiv.org/abs/2602.00238

作者:Tianyi Hu,Niket Tandon,Akhil Arora

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Existing retrieval-augmented generation, Existing retrieval-augmented, single correct answer, primarily designed, single correct

备注

点击查看摘要

Abstract:Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: this https URL

225. 【2602.00161】Block removal for large language models through constrained binary optimization

链接https://arxiv.org/abs/2602.00161

作者:David Jansen,Roman Rausch,David Montero,Roman Orus

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)

关键词:Compressing resource-intensive large, seemingly simple idea, exponentially difficult combinatorial, difficult combinatorial problem, Compressing resource-intensive

备注: 7 pages, 5 figures

点击查看摘要

Abstract:Compressing resource-intensive large language models by removing whole transformer blocks is a seemingly simple idea, but identifying which blocks to remove constitutes an exponentially difficult combinatorial problem. In this paper, we formulate block removal as a constrained binary optimization problem that can be mapped to a physical system (Ising model), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations and yields many high-quality, non-trivial solutions beyond consecutive regions. We demonstrate that our approach outperforms state-of-the-art block-removal methods across several benchmarks, with performance gains persisting after short retraining, and reaching improvements of up to 6 points on the MMLU benchmark. Our method requires only forward and backward passes for a few active parameters, together with an (at least approximate) Ising solver, and can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure.

226. 【2602.00150】Reversible Diffusion Decoding for Diffusion Language Models

链接https://arxiv.org/abs/2602.00150

作者:Xinyun Wang,Min Zhang,Sen Cui,Zhikang Chen,Bo Jiang,Kun Kuang,Mingbao Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:propose Reversible Diffusion, Diffusion language models, diffusion process fails, reverse diffusion process, URL propose Reversible

备注

点击查看摘要

Abstract:Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal this http URL propose Reversible Diffusion Decoding (RDD), a decoding framework that introduces reversibility into block-wise diffusion generation. RDD detects stagnation as a state-dependent failure of the reverse process and enables efficient backtracking to earlier blocks without recomputation via cached model states. To avoid repeated failure trajectories, RDD applies confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable this http URL reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation. Experiments show that RDD improves generation robustness and quality over baselines with minimal computational overhead.

227. 【2602.00092】Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

链接https://arxiv.org/abs/2602.00092

作者:Neha Kalibhat,Zi Wang,Prasoon Bajpai,Drew Proud,Wenjun Zeng,Been Kim,Mani Malek

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language summary, black-box interpretability framework, introduce a black-box, black-box interpretability, natural language

备注

点击查看摘要

Abstract:We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model's specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.

228. 【2602.00085】CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.00085

作者:Shuozhe Li,Jincheng Cao,Bodun Hu,Aryan Mokhtari,Leqi Liu,Amy Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Regularized Reinforcement Finetuning, Reinforcement finetuning, unlocking reasoning capabilities, large language models, RFT

备注

点击查看摘要

Abstract:Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

229. 【2602.00083】SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.00083

作者:Yuxin Yang,Gangda Deng,Ömer Faruk Akgül,Nima Chitsazan,Yash Govilkar,Akasha Tigalappanavara,Shi-Xiong Zhang,Sambit Sahu,Viktor Prasanna

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:grounds large language, requires long reasoning, large language model, language model outputs, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds large language model outputs in external evidence, but remains challenged on multi-hop question answering that requires long reasoning. Recent works scale RAG at inference time along two complementary dimensions: sequential depth for iterative refinement and parallel width for coverage expansion. However, naive scaling causes context contamination and scaling inefficiency, leading to diminishing or negative returns despite increased computation. To address these limitations, we propose SPARC-RAG, a multi-agent framework that coordinates sequential and parallel inference-time scaling under a unified context management mechanism. SPARC-RAG employs specialized agents that maintain a shared global context and provide explicit control over the scaling process. It generates targeted, complementary sub-queries for each branch to enable diverse parallel exploration, and explicitly regulates exiting decisions based on answer correctness and evidence grounding. To optimize scaling behavior, we further introduce a lightweight fine-tuning method with process-level verifiable preferences, which improves the efficiency of sequential scaling and effectiveness of parallel scaling. Across single- and multi-hop QA benchmarks, SPARC-RAG consistently outperforms previous RAG baselines, yielding an average +6.2 F1 improvement under lower inference cost.

230. 【2602.00070】FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

链接https://arxiv.org/abs/2602.00070

作者:Eamon Worden,Cristina Heffernan,Neil Heffernan,Shashank Sonkar

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, LLMs, Large, Knowledge Tracing

备注

点击查看摘要

Abstract:Can Large Language Models understand how students learn? As LLMs are deployed for adaptive testing and personalized tutoring, this question becomes urgent -- yet we cannot answer it with existing resources. Current educational datasets provide only question identifiers and binary correctness labels, rendering them opaque to LLMs that reason in natural language. We address this gap with FoundationalASSIST, the first English educational dataset providing the complete information needed for research on LLMs in education: full question text, actual student responses (not just right/wrong), records of which wrong answers students chose, and alignment to Common Core K-12 standards. These 1.7 million interactions from 5,000 students enable research directions that were previously impossible to pursue, from fine-tuning student models to analyzing misconception patterns. To demonstrate the dataset's utility, we evaluate four frontier models (GPT-OSS-120B, Llama-3.3-70B, Qwen3-Next-80B variants) on two complementary task families: Knowledge Tracing, testing whether LLMs can predict student performance on questions, and the exact answer a student will give; and \textbf{Pedagogical Grounding}, testing whether LLMs understand the properties that make assessment items effective. Our evaluation reveals significant gaps in current LLM capabilities. Every model barely achieves a trivial baseline on knowledge tracing. All models fall below random chance on item discrimination, indicating that LLMs do not understand what makes one problem more diagnostic than another. Models do show competence at judging relative difficulty (up to 68.6%), but this partial success only highlights the gaps elsewhere. These results establish that substantial advances are needed before LLMs can reliably support personalized learning at scale. We release FoundationalASSIST to support progress on these foundational challenges.

231. 【2602.00066】IntentCoding: Amplifying User Intent in Code Generation

链接https://arxiv.org/abs/2602.00066

作者:Zheng Fang,Yihong Dong,Lili Mou,Dongming Jin,Zhi Jin,Ge Li

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, user intent, shown strong capabilities, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong capabilities in code generation, but their adherence to fine-grained user intent with multiple constraints remains a significant challenge. Our empirical analysis reveals two key observations: 1) Model performance deteriorates quickly as the number of constraints in the user intent increases, and 2) While user intent does influence the model's logits, such an influence may not be strong enough to effectively steer the decoding process. To this end, we propose Intent-Amplified Code Generation (IntentCoding), a novel decoding strategy that enhances an LLM's ability to follow user intent. IntentCoding captures the influence of user intent by masking out the intent, and applies a multi-strength ensemble mechanism to amplify the effect of user intent during generation. IntentCoding is model-agnostic, requires no additional training, and integrates seamlessly with existing decoding procedures. To enable systematic evaluation, we also construct CodeConstraints, a benchmark dataset specifically designed to test user intent compliance under varying numbers of constraints. Experiments on our constructed Constraints, as well as popular IFEvalCode, HumanEval and LiveCodeBench datasets, show that our IntentCoding model significantly improves both constraint satisfaction and functional correctness compared to standard decoding approaches. IntentCoding achieves up to 71.0% relative improvement on CodeConstraints, achieves up to 67.3% relative improvement on IFEvalCode and achieves up to 29.3% relative improvement in pass@1 on HumanEval and LiveCodeBench compared with greedy decoding.

232. 【2602.00052】AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

链接https://arxiv.org/abs/2602.00052

作者:Ramtin Babaeipour,François Charest,Madison Wright

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:knowledge management create, management create significant, create significant burden, Increasing clinical trial, Increasing clinical

备注

点击查看摘要

Abstract:Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction CRC workflows. Our RAG process was measured as more accurate (87.8%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In the simulated extraction workflows, AI-assisted tasks were completed 40% faster, rated as less cognitively demanding and strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.

233. 【2602.00046】Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy

链接https://arxiv.org/abs/2602.00046

作者:Sarthak Sattigeri

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:failure in English-language, principled reasoning, persistent alignment failure, English-language evaluations, prioritize agreement

备注: First Hindi sycophancy benchmark using a three-condition design separating language and cultural effects, with empirical evaluation across four instruction-tuned models

点击查看摘要

Abstract:Sycophancy, the tendency of language models to prioritize agreement with user preferences over principled reasoning, has been identified as a persistent alignment failure in English-language evaluations. However, it remains unclear whether such diagnostics generalize across languages and cultural contexts. We extend the Beacon single-turn forced-choice sycophancy diagnostic to Hindi through a controlled three-condition design: English original, Hindi literal translation, and Hindi culturally adapted prompts. We evaluate four open-weight instruction-tuned models on 50 prompts per condition, enabling separation of language encoding effects from cultural adaptation effects. Across all models, sycophancy rates are consistently higher for culturally adapted Hindi prompts than for English, with absolute differences ranging from 12.0 to 16.0 percentage points. A decomposition on Qwen 2.5-Coder-7B shows that cultural adaptation (delta = 14.0%, 95% CI: [4.0%, 26.0%]) accounts for the majority of this gap, while language encoding contributes minimally (delta = 2.0%, 95% CI: [0.0%, 6.0%]). Category-level analysis reveals that advice prompts exhibit the largest cross-lingual differences (20-25 percentage points), achieving statistical significance in two of four models. These findings indicate that alignment behaviors measured in English may not transfer uniformly across languages and that culturally grounded prompt framing plays a substantial role. We release all datasets and evaluation code to support replication and extension.

234. 【2602.00029】Construct, Align, and Reason: Large Ontology Models for Enterprise Knowledge Management

链接https://arxiv.org/abs/2602.00029

作者:Yao Zhang,Hongyin Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Enterprise-scale knowledge management, management faces significant, faces significant challenges, integrating multi-source heterogeneous, multi-source heterogeneous data

备注

点击查看摘要

Abstract:Enterprise-scale knowledge management faces significant challenges in integrating multi-source heterogeneous data and enabling effective semantic reasoning. Traditional knowledge graphs often struggle with implicit relationship discovery and lack sufficient semantic understanding for complex question answering. To address these limitations, we introduce a unified construct--align--reason framework, the large ontology model (LOM). We first build a dual-layer enterprise ontology from structured databases and unstructured text, subsequently fusing these sources into a comprehensive enterprise ontology. To enable instruction-aligned reasoning, we propose a unified three-stage training pipeline: ontology instruction fine-tuning to improve structural understanding; text-ontology grounding to strengthen node semantic encoding; and multi-task instruction tuning on ontology-language pairs with curriculum learning to enhance semantic reasoning and generation. We also construct comprehensive training and evaluation datasets covering diverse ontology reasoning tasks. On this benchmark, our 4B-parameter LOM achieves 89.47% accuracy and outperforms DeepSeek-V3.2 on complex graph reasoning, indicating effective fusion of ontology structure and language.

235. 【2602.00017】SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

链接https://arxiv.org/abs/2602.00017

作者:Benyamin Tabarsi,Wenbo Li,Tahreem Yasir,Aryan Santhosh Kumar,Laura Widman,Dongkuan Xu,Tiffany Barnes

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:challenging to collect, sensitive nature, importance of effective, real-world data, scarce and challenging

备注

点击查看摘要

Abstract:The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.

236. 【2602.00016】PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

链接https://arxiv.org/abs/2602.00016

作者:Jiongchi Yu,Yuhan Ma,Xiaoyu Zhang,Junjie Wang,Qiang Hu,Chao Shen,Xiaofei Xie

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, maintaining a consistent, trust and engagement, authentic LLM personality, increasing deployment

备注: 28 pages

点击查看摘要

Abstract:With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., "Unemployment") can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.

237. 【2602.00015】G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.00015

作者:Xun Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, natural language understanding, Large Language, demonstrated remarkable capabilities, maintaining long-term factual

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot'' or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3\% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.

238. 【2602.00011】Chained Prompting for Better Systematic Review Search Strategies

链接https://arxiv.org/abs/2602.00011

作者:Fatima Nasser,Fouad Trad,Ammar Mohanna,Ghada El-Hajj Fuleihan,Ali Chehab

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:rigorously designed search, Systematic reviews require, Large Language Model, designed search strategies, minimization of bias

备注: Accepted in the 3rd International Conference on Foundation and Large Language Models (FLLM2025)

点击查看摘要

Abstract:Systematic reviews require the use of rigorously designed search strategies to ensure both comprehensive retrieval and minimization of bias. Conventional manual approaches, although methodologically systematic, are resource-intensive and susceptible to subjectivity, whereas heuristic and automated techniques frequently under-perform in recall unless supplemented by extensive expert input. We introduce a Large Language Model (LLM)-based chained prompt engineering framework for the automated development of search strategies in systematic reviews. The framework replicates the procedural structure of manual search design while leveraging LLMs to decompose review objectives, extract and formalize PICO elements, generate conceptual representations, expand terminologies, and synthesize Boolean queries. In addition to query construction, the framework exhibits superior performance in generating well-structured PICO elements relative to existing methods, thereby strengthening the foundation for high-recall search strategies. Evaluation on a subset of the LEADSInstruct dataset demonstrates that the framework attains a 0.9 average recall. These results significantly exceed the performance of existing approaches. Error analysis further highlights the critical role of precise objective specification and terminological alignment in optimizing retrieval effectiveness. These findings confirm the capacity of LLM-based pipelines to yield transparent, reproducible, and high-performing search strategies, and highlight their potential as scalable instruments for supporting evidence synthesis and evidence-based practice.

239. 【2602.00010】ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking

链接https://arxiv.org/abs/2602.00010

作者:Mathieu Ciancone,Clovis Varangot-Reille,Marion Schaeffer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, Retrieval-Augmented Generation applications, Information Retrieval part

备注

点击查看摘要

Abstract:In Retrieval-Augmented Generation applications, the Information Retrieval part is central as it provides the contextual information that enables a Large Language Model to generate an appropriate and truthful response. High quality parsing and chunking are critical as efficient data segmentation directly impacts downstream tasks, i.e. Information Retrieval and answer generation. In this paper, we introduce ChunkNorris, a novel heuristic-based technique designed to optimise the parsing and chunking of PDF documents. Our approach does not rely on machine learning and employs a suite of simple yet effective heuristics to achieve high performance with minimal computational overhead. We demonstrate the efficiency of ChunkNorris through a comprehensive benchmark against existing parsing and chunking methods, evaluating criteria such as execution time, energy consumption, and retrieval accuracy. We propose an open-access dataset to produce our results. ChunkNorris outperforms baseline and more advanced techniques, offering a practical and efficient alternative for Information Retrieval tasks. Therefore, this research highlights the potential of heuristic-based methods for real-world, resource-constrained RAG use cases.

240. 【2602.00009】Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA

链接https://arxiv.org/abs/2602.00009

作者:Samuel Thio,Matthew Lewis,Spiros Denaxas,Richard JB Dobson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Electronic health record, Electronic health, significant cognitive burden, Large Language Models, health record

备注: 26 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Electronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While Large Language Models (LLMs) offer transformative potential for data processing, they face significant limitations in clinical settings, particularly regarding context grounding and hallucinations. Current solutions typically isolate retrieval methods focusing either on structured data (SQL/Cypher) or unstructured semantic search but fail to integrate both simultaneously. This work presents MediGRAF (Medical Graph Retrieval Augmented Framework), a novel hybrid Graph RAG system that bridges this gap. By uniquely combining Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, MediGRAF enables natural language querying of the complete patient journey. Using 10 patients from the MIMIC-IV dataset (generating 5,973 nodes and 5,963 relationships), we generated enough nodes and data for patient level question answering (QA), and we evaluated this architecture across varying query complexities. The system demonstrated 100\% recall for factual queries which means all relevant information was retrieved and in the output, while complex inference tasks achieved a mean expert quality score of 4.25/5 with zero safety violations. These results demonstrate that hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments.

241. 【2602.00007】PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering

链接https://arxiv.org/abs/2602.00007

作者:MinGyu Jeon,SuWan Cho,JaeYoung Shu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, complex question answering, Large Language, Language Models, Knowledge Graphs

备注

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.

242. 【2602.00004】C$^2$-Cite: Contextual-Aware Citation Generation for Attributed Large Language Models

链接https://arxiv.org/abs/2602.00004

作者:Yue Yu,Ting Bai,HengZhi Lan,Li Qian,Li Peng,Jie Wu,Wei Liu,Jian Luan,Chuan Shi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)

关键词:attribution technique enhances, enabling users, attribution technique, technique enhances, enhances the credibility

备注: WSDM26

点击查看摘要

Abstract:The attribution technique enhances the credibility of LLMs by adding citations to the generated sentences, enabling users to trace back to the original sources and verify the reliability of the output. However, existing instruction-tuned attributed LLMs often fail to properly interpret the contextual semantics of citation symbols (e.g., [i]) during text generation. This shortcoming arises from their insufficient awareness of the context information surrounding citation markers, which in turn leads to disjointed references and poor integration of retrieved knowledge into the generated content. To address this issue, we propose a novel \textbf{C}ontextual-aware \textbf{C}itation generation framework (\textbf{C$^2$}-\textbf{Cite}) that explicitly integrates the semantic relationships between citation markers and their referenced content. Specifically, a contextual citation alignment mechanism is adopted: it first encodes the retrieved document contexts into the symbol representation of citations, then aligns the marker numbers by decoding information from a citation router function. This mechanism enables the transformation of citation markers from generic placeholders into active knowledge pointers that link to the referenced source information. Experimental results on the ALCE benchmark across three datasets validate our framework C$^2$-Cite++: it outperforms the SOTA baseline by an average of 5.8\% in citation quality and 17.4\% in response correctness. The implementation is publicly available at this https URL

243. 【2507.21934】Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

链接https://arxiv.org/abs/2507.21934

作者:Tianyi Hu,Andrea Morales-Garzón,Jingyi Zheng,Maria Maistro,Daniel Hershcovich

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:original dish essence, ensure cultural appropriateness, provide diverse options, dish essence, Retrieval Augmented Generation

备注

点击查看摘要

Abstract:In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish's essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.

244. 【2602.01008】Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages

链接https://arxiv.org/abs/2602.01008

作者:Yang Xiao,Eun-Jung Holden,Ting Dang

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Recent speech foundation, automatic speech recognition, remains challenging due, Recent speech, speech foundation models

备注: 13 pages

点击查看摘要

Abstract:Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.

245. 【2602.00197】Rank-and-Reason: Multi-Agent Collaboration Accelerates Zero-Shot Protein Mutation Prediction

链接https://arxiv.org/abs/2602.00197

作者:Yang Tan,Yuyuan Xi,Can Wu,Bozitao Zhong,Mingchen Li,Guisheng Fan,Jiankang Zhu,Yafeng Liang,Nanqing Dong,Liang Hong

类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Zero-shot mutation prediction, low-resource protein engineering, protein language models, existing protein language, yield statistically confident

备注: 22 pages, 5 figures, 15 tables

点击查看摘要

Abstract:Zero-shot mutation prediction is vital for low-resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet-lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank-and-Reason (VenusRAR), a two-stage agentic framework to automate this workflow and maximize expected wet-lab fitness. In the Rank-Stage, a Computational Expert and Virtual Biologist aggregate a context-aware multi-modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason-Stage, an agentic Expert Panel employs chain-of-thought reasoning to audit candidates against geometric and structural constraints, improving the Top-5 Hit Rate by up to 367% on ProteinGym-DMS99. The wet-lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23-fold and 5.05-fold activity improvements. Code and datasets are released on GitHub (this https URL).

信息检索

1. 【2602.02444】RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

链接https://arxiv.org/abs/2602.02444

作者:Tyler Skow,Alexander Martin,Benjamin Van Durme,Rama Chellappa,Reno Kriz

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:modern retrieval systems, efficient first-stage retriever, refine results, critical component, component of modern

备注

点击查看摘要

Abstract:Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.

2. 【2602.02386】rust by Design: Skill Profiles for Transparent, Cost-Aware LLM Routing

链接https://arxiv.org/abs/2602.02386

作者:Mika Okamoto,Ansel Kaplan Erol,Glenn Matlin

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Model, Large Language, Language Model, wasting money, LLM Selection

备注: Appeared at MLSys YPS 2025

点击查看摘要

Abstract:How should Large Language Model (LLM) practitioners select the right model for a task without wasting money? We introduce BELLA (Budget-Efficient LLM Selection via Automated skill-profiling), a framework that recommends optimal LLM selection for tasks through interpretable skill-based model selection. Standard benchmarks report aggregate metrics that obscure which specific capabilities a task requires and whether a cheaper model could suffice. BELLA addresses this gap through three stages: (1) decomposing LLM outputs and extract the granular skills required by using critic-based profiling, (2) clustering skills into structured capability matrices, and (3) multi-objective optimization to select the right models to maximize performance while respecting budget constraints. BELLA provides natural-language rationale for recommendations, providing transparency that current black-box routing systems lack. We describe the framework architecture, situate it within the landscape of LLM routing and evaluation, and discuss its application to financial reasoning as a representative domain exhibiting diverse skill requirements and cost-variation across models. Our framework enables practitioners to make principled and cost-performance trade-offs for deploying LLMs.

3. 【2602.02343】Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

链接https://arxiv.org/abs/2602.02343

作者:Ziwen Xu,Chenyan Wu,Hengyu Sun,Haiwen Hong,Mengru Wang,Yunzhi Yao,Longtao Huang,Hui Xue,Shumin Deng,Zhixuan Chu,Huajun Chen,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:making comparison difficult, controlling large language, including local weight, local weight fine-tuning, large language models

备注: Work in progress

点击查看摘要

Abstract:Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at this https URL.

4. 【2602.02338】Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

链接https://arxiv.org/abs/2602.02338

作者:Yu Liang,Zhongjin Zhang,Yuxuan Zhu,Kerui Zhang,Zhiluohan Guo,Wenhang Zhou,Zonqi Yang,Kangle Wu,Yabo Ni,Anxiang Zeng,Cong Fu,Jianxin Wang,Jiazhi Xia

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:existing methods largely, methods largely follow, sequential recommender systems, generic quantization schemes, scaling sequential recommender

备注

点击查看摘要

Abstract:Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at this https URL.

5. 【2602.02208】owards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study

链接https://arxiv.org/abs/2602.02208

作者:Md. Toufique Hasan,Ayman Asad Khan,Mika Saari,Vaishnavi Bankhele,Pekka Abrahamsson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)

关键词:English-centric training data, limited real-world evaluation, English-centric training, Large language models, Large language

备注: 6 pages, 2 figures, submitted to MIPRO 2026

点击查看摘要

Abstract:Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.

6. 【2602.02154】Deep learning enables urban change profiling through alignment of historical maps

链接https://arxiv.org/abs/2602.02154

作者:Sidi Wu,Yizi Chen,Maurizio Gribaudi,Konrad Schindler,Clément Mallet,Julien Perret,Lorenz Hurni

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Earth observation technologies, modern Earth observation, Prior to modern, modern Earth, Earth observation

备注: 40 pages

点击查看摘要

Abstract:Prior to modern Earth observation technologies, historical maps provide a unique record of long-term urban transformation and offer a lens on the evolving identity of cities. However, extracting consistent and fine-grained change information from historical map series remains challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches. We propose a fully automated, deep learning-based framework for fine-grained urban change analysis from large collections of historical maps, built on a modular design that integrates dense map alignment, multi-temporal object detection, and change profiling. This framework shifts the analysis of historical maps from ad hoc visual comparison toward systematic, quantitative characterization of urban change. Experiments demonstrate the robust performance of the proposed alignment and object detection methods. Applied to Paris between 1868 and 1937, the framework reveals the spatial and temporal heterogeneity in urban transformation, highlighting its relevance for research in the social sciences and humanities. The modular design of our framework further supports adaptation to diverse cartographic contexts and downstream applications.

7. 【2602.02024】Adaptive Quality-Diversity Trade-offs for Large-Scale Batch Recommendation

链接https://arxiv.org/abs/2602.02024

作者:Clémence Réda(IBENS),Tomas Rigaux,Hiba Bederina(SODA),Koh Takeuchi,Hisashi Kashima,Jill-Jênn Vie(SODA)

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:core research question, comfort zone, core research, research question, question in recommender

备注

点击查看摘要

Abstract:A core research question in recommender systems is to propose batches of highly relevant and diverse items, that is, items personalized to the user's preferences, but which also might get the user out of their comfort zone. This diversity might induce properties of serendipidity and novelty which might increase user engagement or revenue. However, many real-life problems arise in that case: e.g., avoiding to recommend distinct but too similar items to reduce the churn risk, and computational cost for large item libraries, up to millions of items. First, we consider the case when the user feedback model is perfectly observed and known in advance, and introduce an efficient algorithm called B-DivRec combining determinantal point processes and a fuzzy denuding procedure to adjust the degree of item diversity. This helps enforcing a quality-diversity trade-off throughout the user history. Second, we propose an approach to adaptively tailor the quality-diversity trade-off to the user, so that diversity in recommendations can be enhanced if it leads to positive feedback, and vice-versa. Finally, we illustrate the performance and versatility of B-DivRec in the two settings on synthetic and real-life data sets on movie recommendation and drug repurposing.

8. 【2602.01969】Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

链接https://arxiv.org/abs/2602.01969

作者:Bin Cao,Huixian Lu,Chenwen Ma,Ting Wang,Ruizhe Li,Jing Fan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:heterogeneous layouts pose, layouts pose persistent, pose persistent challenges, understanding and reasoning, heterogeneous layouts

备注: Work in process

点击查看摘要

Abstract:Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.

9. 【2602.01865】GRAB: An LLM-Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm

链接https://arxiv.org/abs/2602.01865

作者:Shaopeng Chen,Chuyue Xie,Huimin Ren,Shaozong Zhang,Han Zhang,Ruobing Cheng,Zhiqiang Cao,Zehao Ju,Gao Yu,Jie Ding,Xiaodong Chen,Xuewu Jiao,Shuanglong Li,Liu Lin

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Traditional Deep Learning, Deep Learning Recommendation, Learning Recommendation Models, Traditional Deep, face increasing bottlenecks

备注

点击查看摘要

Abstract:Traditional Deep Learning Recommendation Models (DLRMs) face increasing bottlenecks in performance and efficiency, often struggling with generalization and long-sequence modeling. Inspired by the scaling success of Large Language Models (LLMs), we propose Generative Ranking for Ads at Baidu (GRAB), an end-to-end generative framework for Click-Through Rate (CTR) prediction. GRAB integrates a novel Causal Action-aware Multi-channel Attention (CamA) mechanism to effectively capture temporal dynamics and specific action signals within user behavior sequences. Full-scale online deployment demonstrates that GRAB significantly outperforms established DLRMs, delivering a 3.05% increase in revenue and a 3.49% rise in CTR. Furthermore, the model demonstrates desirable scaling behavior: its expressive power shows a monotonic and approximately linear improvement as longer interaction sequences are utilized.

10. 【2602.01712】Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science

链接https://arxiv.org/abs/2602.01712

作者:Muneer Ahmad,Undie Felicia Nkatv,Amrita Sharma,Gorrety Maria Juma,Nicholas Kamoga,Julirine Nakanwag

类目:Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR)

关键词:Avian Influenza research, analyzes Avian Influenza, scientometric study analyzes, Web of Science, study analyzes Avian

备注: 24 pages, 7 figures, Research Article

点击查看摘要

Abstract:This scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. "Articles" are the most common document type, constituting 84.6% of the total, while "Reviews" account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders.

11. 【2602.01686】Unmediated AI-Assisted Scholarly Citations

链接https://arxiv.org/abs/2602.01686

作者:Stefan Szeider

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:manually copy citation, Traditional bibliography databases, databases require users, navigate search forms, Traditional bibliography

备注

点击查看摘要

Abstract:Traditional bibliography databases require users to navigate search forms and manually copy citation data. Language models offer an alternative: a natural-language interface where researchers write text with informal citation fragments, which are automatically resolved to proper references. However, language models are not reliable for scholarly work as they generate fabricated (hallucinated) citations at substantial rates. We present an architectural approach that combines the natural-language interface of LLM chatbots with the accuracy of direct database access, implemented through the Model Context Protocol. Our system enables language models to search bibliographic databases, perform fuzzy matching, and export verified entries, all through conversational interaction. A key architectural principle bypasses the language model during final data export: entries are fetched directly from authoritative sources, with timeout protection, to guarantee accuracy. We demonstrate this approach with MCP-DBLP, a server providing access to the DBLP computer science bibliography. The system transforms form-based bibliographic services into conversational assistants that maintain scholarly integrity. This architecture is adaptable to other bibliographic databases and academic data sources.

Subjects:

Digital Libraries (cs.DL); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.01686 [cs.DL]

(or
arXiv:2602.01686v1 [cs.DL] for this version)

https://doi.org/10.48550/arXiv.2602.01686

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Open Conference Proceedings, Vol. 8 (2026): The Second Bridge on Artificial Intelligence for Scholarly Communication (AAAI-26)

Related DOI:

https://doi.org/10.52825/ocp.v8i.3161

Focus to learn more

            DOI(s) linking to related resources</p>
12. 【2602.01572】LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

链接https://arxiv.org/abs/2602.01572

作者:Yeqin Zhang,Yunfei Wang,Jiaxuan Chen,Ke Qin,Yizheng Zhao,Cam-Tu Nguyen

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Natural Language Processing, Language Processing, Natural Language, Large Language Models, leverage Large Language

备注

点击查看摘要

Abstract:Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

13. 【2602.01450】he Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT

链接https://arxiv.org/abs/2602.01450

作者:Abhisek Dash,Soumi Das,Elisabeth Kirsten,Qinyuan Wu,Sai Keerthana Karnam,Krishna P. Gummadi,Thorsten Holz,Muhammad Bilal Zafar,Savvas Zannettou

类目:Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:personalized and context-aware, Algorithmic Self-portrait, context-aware interactions, Memory, enable personalized

备注: This paper has been accepted at The ACM Web Conference 2026

点击查看摘要

Abstract:To enable personalized and context-aware interactions, conversational AI systems have introduced a new mechanism: Memory. Memory creates what we refer to as the Algorithmic Self-portrait - a new form of personalization derived from users' self-disclosed information divulged within private conversations. While memory enables more coherent exchanges, the underlying processes of memory creation remain opaque, raising critical questions about data sensitivity, user agency, and the fidelity of the resulting portrait. To bridge this research gap, we analyze 2,050 memory entries from 80 real-world ChatGPT users. Our analyses reveal three key findings: (1) A striking 96% of memories in our dataset are created unilaterally by the conversational system, potentially shifting agency away from the user; (2) Memories, in our dataset, contain a rich mix of GDPR-defined personal data (in 28% memories) along with psychological insights about participants (in 52% memories); and (3)~A significant majority of the memories (84%) are directly grounded in user context, indicating faithful representation of the conversations. Finally, we introduce a framework-Attribution Shield-that anticipates these inferences, alerts about potentially sensitive memory inferences, and suggests query reformulations to protect personal information without sacrificing utility.

Comments:
This paper has been accepted at The ACM Web Conference 2026

Subjects:

Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.01450 [cs.HC]

(or
arXiv:2602.01450v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2602.01450

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2602.01246】PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian

链接https://arxiv.org/abs/2602.01246

作者:Jamshid Mozafari,Seyed Parsa Mousavinasab,Adam Jatowt

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Reasoning-focused Question Answering, languages remain scarce, Large Language Models, Large Language, low-resource languages remain

备注: Submitted to SIGIR 2026

点击查看摘要

Abstract:Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.

15. 【2602.01239】Inferential Question Answering

链接https://arxiv.org/abs/2602.01239

作者:Jamshid Mozafari,Hamed Zamani,Guido Zuccon,Adam Jatowt

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:existing work focuses, extensive research, wide range, existing work, work focuses

备注: Proceedings of the ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.

16. 【2602.01023】Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective Alignment

链接https://arxiv.org/abs/2602.01023

作者:Kai Yuan(1),Anthony Zheng(1),Jia Hu(1),Divyanshu Sheth(1),Hemanth Velaga(1),Kylee Kim(1),Matteo Guarrera(2),Besim Avci(2),Xuetao Yin(1),Rajyashree Mukherjee(1),Sean Suchter(1) ((1) Apple, (2) UC Berkeley)

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:suggests query completions, Query Auto-Completion, suggests query, query completions, users type

备注: 11 pages, 4 figures

点击查看摘要

Abstract:Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44\% reduction in keystrokes and 3.46\% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.

17. 【2602.00899】Domain-Adaptive and Scalable Dense Retrieval for Content-Based Recommendation

链接https://arxiv.org/abs/2602.00899

作者:Mritunjay Pandey(Aditya Birla Group)

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:sparse keyword matching, search commonly rely, limited lexical overlap, E-commerce recommendation, Negatives Ranking Loss

备注: 13 pages, 4 figures. Semantic dense retrieval for content-based recommendation on Amazon Reviews 2023 (Category - Fashion). Dataset statistics: 2.0M users; 825.9K items; 2.5M ratings; 94.9M review tokens; 510.5M metadata tokens. Timespan: May 1996 to September 2023. Metadata includes: user reviews (ratings, text, helpfulness votes, etc.); item metadata (descriptions, price, raw images, etc.)

点击查看摘要

Abstract:E-commerce recommendation and search commonly rely on sparse keyword matching (e.g., BM25), which breaks down under vocabulary mismatch when user intent has limited lexical overlap with product metadata. We cast content-based recommendation as recommendation-as-retrieval: given a natural-language intent signal (a query or review), retrieve the top-K most relevant items from a large catalog via semantic similarity. We present a scalable dense retrieval system based on a two-tower bi-encoder, fine-tuned on the Amazon Reviews 2023 (Fashion) subset using supervised contrastive learning with Multiple Negatives Ranking Loss. We construct training pairs from review text (as a query proxy) and item metadata (as the positive document) and fine-tune on 50,000 sampled interactions with a maximum sequence length of 500 tokens. For efficient serving, we combine FAISS HNSW indexing with an ONNX Runtime inference pipeline using INT8 dynamic quantization. On a review-to-title benchmark over 826,402 catalog items, our approach improves Recall@10 from 0.26 (BM25) to 0.66, while meeting practical latency and model-size constraints: 6.1 ms median CPU inference latency (batch size 1) and a 4x reduction in model size. Overall, we provide an end-to-end, reproducible blueprint for taking domain-adapted dense retrieval from offline training to CPU-efficient serving at catalog scale.

Comments:
13 pages, 4 figures. Semantic dense retrieval for content-based recommendation on Amazon Reviews 2023 (Category - Fashion). Dataset statistics: 2.0M users; 825.9K items; 2.5M ratings; 94.9M review tokens; 510.5M metadata tokens. Timespan: May 1996 to September 2023. Metadata includes: user reviews (ratings, text, helpfulness votes, etc.); item metadata (descriptions, price, raw images, etc.)

Subjects:

Machine Learning (cs.LG); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.00899 [cs.LG]

(or
arXiv:2602.00899v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.00899

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Mritunjay Pandey [view email] [v1]
Sat, 31 Jan 2026 20:58:23 UTC (416 KB)

18. 【2602.00857】Unifying Adversarial Robustness and Training Across Text Scoring Models

链接https://arxiv.org/abs/2602.00857

作者:Manveer Singh Tamber,Hosna Oyarhoseini,Jimmy Lin

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:obscuring shared vulnerabilities, obscuring shared, shared vulnerabilities, fragmented across applications, text scoring

备注

点击查看摘要

Abstract:Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.

19. 【2602.00805】Optimizing Retrieval Components for a Shared Backbone via Component-Wise Multi-Stage Training

链接https://arxiv.org/abs/2602.00805

作者:Yunhan Li,Mingjie Xie,Zihan Gong,Zeyang Shi,Gengshen Wu,Min Yang

类目:Information Retrieval (cs.IR)

关键词:Recent advances, multiple downstream applications, enabled dense retrievers, advances in embedding-based, serve as core

备注: 4 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Recent advances in embedding-based retrieval have enabled dense retrievers to serve as core infrastructure in many industrial systems, where a single retrieval backbone is often shared across multiple downstream applications. In such settings, retrieval quality directly constrains system performance and extensibility, while coupling model selection, deployment, and rollback decisions across applications. In this paper, we present empirical findings and a system-level solution for optimizing retrieval components deployed as a shared backbone in production legal retrieval systems. We adopt a multi-stage optimization framework for dense retrievers and rerankers, and show that different retrieval components exhibit stage-dependent trade-offs. These observations motivate a component-wise, mixed-stage configuration rather than relying on a single uniformly optimal checkpoint. The resulting backbone is validated through end-to-end evaluation and deployed as a shared retrieval service supporting multiple industrial applications.

Comments:
4 pages, 3 figures, 3 tables

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2602.00805 [cs.IR]

(or
arXiv:2602.00805v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.00805

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2602.00793】SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality

链接https://arxiv.org/abs/2602.00793

作者:Yoonsang Kim,Devshree Jadeja,Divyansh Pradhan,Yalong Yang,Arie Kaufman

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)

关键词:day creates unnecessary, Speaking aloud, creates unnecessary effort, wearable AR assistant, requests every day

备注: 11 pages, 9 figures. This is the author's version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026

点击查看摘要

Abstract:Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users "speak less," while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.

21. 【2602.00758】mporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting

链接https://arxiv.org/abs/2602.00758

作者:Ali El Lahib,Ying-Jieh Xia,Zehan Li,Yuxuan Wang,Xinyu Pi

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Search-engine date filters, Search-engine date, enforce pre-cutoff retrieval, search-augmented forecasters, enforce pre-cutoff

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.

22. 【2602.00730】owards Trustworthy Multimodal Recommendation

链接https://arxiv.org/abs/2602.00730

作者:Zixuan Li

类目:Information Retrieval (cs.IR)

关键词:Recent advances, demonstrated the effectiveness, effectiveness of incorporating, incorporating visual, visual and textual

备注: Preprint, 10 pages, 5 figures

点击查看摘要

Abstract:Recent advances in multimodal recommendation have demonstrated the effectiveness of incorporating visual and textual content into collaborative filtering. However, real-world deployments raise an increasingly important yet underexplored issue: trustworthiness. On modern e-commerce platforms, multimodal content can be misleading or unreliable (e.g., visually inconsistent product images or click-bait titles), injecting untrustworthy signals into multimodal representations and making existing recommenders brittle under modality corruption. In this work, we take a step towards trustworthy multimodal recommendation from both a method and an analysis perspective. First, we propose a plug-and-play modality-level rectification component that mitigates untrustworthy modality features by learning soft correspondences between items and multimodal features. Using lightweight projections and Sinkhorn-based soft matching, the rectification suppresses mismatched modality signals while preserving semantic consistency, and can be integrated into existing multimodal recommenders without architectural modifications. Second, we present two practical insights on interaction-level trustworthiness under noisy collaborative signals: (i) training-set pseudo interactions can help or hurt performance under noise depending on prior-signal alignment; and (ii) propagation-graph pseudo edges can also help or hurt robustness, as message passing may amplify misalignment. Extensive experiments on multiple datasets and backbones under varying corruption levels demonstrate improved robustness from modality rectification and validate the above interaction-level observations.

23. 【2602.00727】SWGCN: Synergy Weighted Graph Convolutional Network for Multi-Behavior Recommendation

链接https://arxiv.org/abs/2602.00727

作者:Fangda Chen,Yueyang Wang,Chaoli Lou,Min Gao,Qingyu Xiong

类目:Information Retrieval (cs.IR)

关键词:Multi-behavior recommendation paradigms, forecasting primary conversions, Multi-behavior recommendation, diverse user activities, capture diverse user

备注: Accepted by Information Sciences

点击查看摘要

Abstract:Multi-behavior recommendation paradigms have emerged to capture diverse user activities, forecasting primary conversions (e.g., purchases) by leveraging secondary signals like browsing history. However, current graph-based methods often overlook cross-behavioral synergistic signals and fine-grained intensity of individual actions. Motivated by the need to overcome these shortcomings, we introduce Synergy Weighted Graph Convolutional Network (SWGCN). SWGCN introduces two novel components: a Target Preference Weigher, which adaptively assigns weights to user-item interactions within each behavior, and a Synergy Alignment Task, which guides its training by leveraging an Auxiliary Preference Valuator. This task prioritizes interactions from synergistic signals that more accurately reflect user preferences. The performance of our model is rigorously evaluated through comprehensive tests on three open-source datasets, specifically Taobao, IJCAI, and Beibei. On the Taobao dataset, SWGCN yields relative gains of 112.49% and 156.36% in terms of Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), respectively. It also yields consistent gains on IJCAI and Beibei, confirming its robustness and generalizability across various datasets. Our implementation is open-sourced and can be accessed via this https URL.

24. 【2602.00699】From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development

链接https://arxiv.org/abs/2602.00699

作者:Xuan Liu,Ziyu Li,Mu He,Ziyang Ma,Xiaoxu Wu,Gizem Yilmaz,Yiyuan Xia,Bingbing Li,He Tan,Jerry Ying Hsi Fuh,Wen Feng Lu,Anders E.W. Jarfors,Per Jansson

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Ontologies are essential, structuring domain knowledge, improving accessibility, Large Language Models, essential for structuring

备注: 11 pages,8 figures,3 tables,presented at International Conference on Industry of the Future and Smart Manufacturing,2025

点击查看摘要

Abstract:Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse. However, traditional ontology construction relies on manual annotation and conventional natural language processing (NLP) techniques, making the process labour-intensive and costly, especially in specialised fields like casting manufacturing. The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction. This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts using limited data. We compare their performances and use the best-performing method to build a casting ontology that validated by domian expert.

25. 【2602.00682】RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment

链接https://arxiv.org/abs/2602.00682

作者:Yuecheng Li,Hengwei Ju,Zeyu Song,Wei Yang,Chi Lu,Peng Jiang,Kun Gai

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:accurate user preferences, systems typically integrates, typically integrates user, integrates user behavior, typically integrates

备注: Under Review

点击查看摘要

Abstract:Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM-enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item-item, user-item, and user-user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual-granularity progressive multimodality-ID alignment framework, which achieves instance-level and distribution-level semantic alignment via cross-modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights. Additionally, the deployment on a large-scale online advertising platform confirms the model's effectiveness and scalability in industrial recommendation scenarios. Code available at this https URL.

26. 【2602.00681】Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

链接https://arxiv.org/abs/2602.00681

作者:Ilyass Moummad,Marius Miron,Lukas Rauch,David Robinson,Alexis Joly,Olivier Pietquin,Emmanuel Chemla,Matthieu Geist

类目:ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:learning aligned audio-image, offers an interpretable, interpretable alternative, alternative to audio-only, audio-only classification

备注

点击查看摘要

Abstract:Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

27. 【2602.00632】owards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

链接https://arxiv.org/abs/2602.00632

作者:Hongxun Ding,Keqin Bao,Jizhi Zhang,Yi Fang,Wenxin Xu,Fuli Feng,Xiangnan He

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, Large Language, promise in Large, enhancing recommendation quality

备注

点击查看摘要

Abstract:While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at this https URL.

28. 【2602.00495】Equity vs. Equality: Optimizing Ranking Fairness for Tailored Provider Needs

链接https://arxiv.org/abs/2602.00495

作者:Yiteng Tu,Weihang Su,Shuguang Han,Yiqun Liu,Qingyao Ai

类目:Information Retrieval (cs.IR)

关键词:Information Retrieval, important challenge, plays a central, central role, role in connecting

备注

点击查看摘要

Abstract:Ranking plays a central role in connecting users and providers in Information Retrieval (IR) systems, making provider-side fairness an important challenge. While recent research has begun to address fairness in ranking, most existing approaches adopt an equality-based perspective, aiming to ensure that providers with similar content receive similar exposure. However, it overlooks the diverse needs of real-world providers, whose utility from ranking may depend not only on exposure but also on outcomes like sales or engagement. Consequently, exposure-based fairness may not accurately capture the true utility perceived by different providers with varying priorities. To this end, we introduce an equity-oriented fairness framework that explicitly models each provider's preferences over key outcomes such as exposure and sales, thus evaluating whether a ranking algorithm can fulfill these individualized goals while maintaining overall fairness across providers. Based on this framework, we develop EquityRank, a gradient-based algorithm that jointly optimizes user-side effectiveness and provider-side equity. Extensive offline and online simulations demonstrate that EquityRank offers improved trade-offs between effectiveness and fairness and adapts to heterogeneous provider needs.

29. 【2602.00296】RAGRouter-Bench: A Dataset and Benchmark for Adaptive RAG Routing

链接https://arxiv.org/abs/2602.00296

作者:Ziqi Wang,Xi Zhu,Shuhang Lin,Haochen Xue,Minghao Guo,Yongfeng Zhang

类目:Information Retrieval (cs.IR)

关键词:grounding large language, large language models, external knowledge, grounding large, large language

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a core paradigm for grounding large language models with external knowledge. Despite extensive efforts exploring diverse retrieval strategies, existing studies predominantly focus on query-side complexity or isolated method improvements, lacking a systematic understanding of how RAG paradigms behave across different query-corpus contexts and effectiveness-efficiency trade-offs. In this work, we introduce RAGRouter-Bench, the first dataset and benchmark designed for adaptive RAG routing. RAGRouter-Bench revisits retrieval from a query-corpus compatibility perspective and standardizes five representative RAG paradigms for systematic evaluation across 7,727 queries and 21,460 documents spanning diverse domains. The benchmark incorporates three canonical query types together with fine-grained semantic and structural corpus metrics, as well as a unified evaluation for both generation quality and resource consumption. Experiments with DeepSeek-V3 and LLaMA-3.1-8B demonstrate that no single RAG paradigm is universally optimal, that paradigm applicability is strongly shaped by query-corpus interactions, and that increased advanced mechanism does not necessarily yield better effectiveness-efficiency trade-offs. These findings underscore the necessity of routing-aware evaluation and establish a foundation for adaptive, interpretable, and generalizable next-generation RAG systems.

30. 【2602.00160】First Steps, Lasting Impact: Platform-Aware Forensics for the Next Generation of Analysts

链接https://arxiv.org/abs/2602.00160

作者:Vinayak Jain,Sneha Sudhakaran,Saranyan Senthivel

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Linux, file system structures, due to inherent, strongly influenced, inherent variations

备注: 21st International Conference on Cyber Warfare and Security (ICCWS 2026)

点击查看摘要

Abstract:The reliability of cyber forensic evidence acquisition is strongly influenced by the underlying operating systems, Windows, macOS, and Linux - due to inherent variations in file system structures, encryption protocols, and forensic tool compatibility. Disk forensics, one of the most widely used techniques in digital investigations, faces distinct obstacles on each platform. Windows, with its predominantly NTFS and FAT file systems, typically supports reliable disk imaging and analysis through established tools such as FTK Imager and Autopsy/Sleuth Kit. However, encryption features frequently pose challenges to evidence acquisition. Conversely, Linux environments, which rely on file systems like ext4 and XFS, generally offer greater transparency, yet the transient nature of log retention often complicates forensic analysis. In instances where anti-forensic strategies such as encryption and compression render traditional disk forensics insufficient, memory forensics becomes crucial. While memory forensic methodologies demonstrate robustness across Windows and Linux platforms forms through frameworks like Volatility, platform-specific difficulties persist. Memory analysis on Linux systems benefits from tools like LiME, snapshot utilities, and dd for memory acquisition; nevertheless, live memory acquisition on Linux can still present challenges. This research systematically assesses both disk and memory forensic acquisition techniques across samples representing Windows and Linux systems. By identifying effective combinations of forensic tools and configurations tailored to each operating system, the study aims to improve the accuracy and reliability of evidence collection. It further evaluates current forensic tools and highlights a persistent gap: consistently assuring forensic input reliability and footprint integrity.

31. 【2602.00083】SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.00083

作者:Yuxin Yang,Gangda Deng,Ömer Faruk Akgül,Nima Chitsazan,Yash Govilkar,Akasha Tigalappanavara,Shi-Xiong Zhang,Sambit Sahu,Viktor Prasanna

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:grounds large language, requires long reasoning, large language model, language model outputs, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds large language model outputs in external evidence, but remains challenged on multi-hop question answering that requires long reasoning. Recent works scale RAG at inference time along two complementary dimensions: sequential depth for iterative refinement and parallel width for coverage expansion. However, naive scaling causes context contamination and scaling inefficiency, leading to diminishing or negative returns despite increased computation. To address these limitations, we propose SPARC-RAG, a multi-agent framework that coordinates sequential and parallel inference-time scaling under a unified context management mechanism. SPARC-RAG employs specialized agents that maintain a shared global context and provide explicit control over the scaling process. It generates targeted, complementary sub-queries for each branch to enable diverse parallel exploration, and explicitly regulates exiting decisions based on answer correctness and evidence grounding. To optimize scaling behavior, we further introduce a lightweight fine-tuning method with process-level verifiable preferences, which improves the efficiency of sequential scaling and effectiveness of parallel scaling. Across single- and multi-hop QA benchmarks, SPARC-RAG consistently outperforms previous RAG baselines, yielding an average +6.2 F1 improvement under lower inference cost.

32. 【2602.00052】AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

链接https://arxiv.org/abs/2602.00052

作者:Ramtin Babaeipour,François Charest,Madison Wright

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:knowledge management create, management create significant, create significant burden, Increasing clinical trial, Increasing clinical

备注

点击查看摘要

Abstract:Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction CRC workflows. Our RAG process was measured as more accurate (87.8%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In the simulated extraction workflows, AI-assisted tasks were completed 40% faster, rated as less cognitively demanding and strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.

33. 【2602.00013】Linear-PAL: A Lightweight Ranker for Mitigating Shortcut Learning in Personalized, High-Bias Tabular Ranking

链接https://arxiv.org/abs/2602.00013

作者:Vipul Dinesh Pawar

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Position Bias, implicit user feedback, confounded by Position, implicit user, Position-bias Aware Learning

备注

点击查看摘要

Abstract:In e-commerce ranking, implicit user feedback is systematically confounded by Position Bias -- the strong propensity of users to interact with top-ranked items regardless of relevance. While Deep Learning architectures (e.g., Two-Tower Networks) are the standard solution for de-biasing, we demonstrate that in High-Bias Regimes, state-of-the-art Deep Ensembles suffer from Shortcut Learning: they minimize training loss by overfitting to the rank signal, leading to degraded ranking quality despite high prediction accuracy. We propose Linear Position-bias Aware Learning (Linear-PAL), a lightweight framework that enforces de-biasing through structural constraints: explicit feature conjunctions and aggressive regularization. We further introduce a Vectorized Integer Hashing technique for feature generation, replacing string-based operations with $O(N)$ vectorized arithmetic. Evaluating on a large-scale dataset (4.2M samples), Linear-PAL achieves Pareto Dominance: it outperforms Deep Ensembles in de-biased ranking quality (Relevance AUC: 0.7626 vs. 0.6736) while reducing training latency by 43x (40s vs 1762s). This computational efficiency enables high-frequency retraining, allowing the system to capture user-specific emerging market trends and deliver robust, personalized ranking in near real-time.

34. 【2602.00012】OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

链接https://arxiv.org/abs/2602.00012

作者:Michael Siebenmann(1),Javier Argota Sánchez-Vaquerizo(1),Stefan Arisona(2),Krystian Samp(2),Luis Gisler(2),Dirk Helbing(1 and 3) ((1) Professorship of Computational Social Science, ETH Zurich, (2) Esri Ramp;D Center Zurich, (3) Complexity Science Hub, Vienna)

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models, geospatial Open Government, Large Language, reproducible framework based

备注: This work has been submitted to the IEEE for possible publication. 7 pages, 6 figures

点击查看摘要

Abstract:We present OGD4All, a transparent, auditable, and reproducible framework based on Large Language Models (LLMs) to enhance citizens' interaction with geospatial Open Government Data (OGD). The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs. Evaluated on a 199-question benchmark covering both factual and unanswerable questions, across 430 City-of-Zurich datasets and 11 LLMs, OGD4All reaches 98% analytical correctness and 94% recall while reliably rejecting questions unsupported by available data, which minimizes hallucination risks. Statistical robustness tests, as well as expert feedback, show reliability and social relevance. The proposed approach shows how LLMs can provide explainable, multimodal access to public data, advancing trustworthy AI for open governance.

35. 【2602.00011】Chained Prompting for Better Systematic Review Search Strategies

链接https://arxiv.org/abs/2602.00011

作者:Fatima Nasser,Fouad Trad,Ammar Mohanna,Ghada El-Hajj Fuleihan,Ali Chehab

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:rigorously designed search, Systematic reviews require, Large Language Model, designed search strategies, minimization of bias

备注: Accepted in the 3rd International Conference on Foundation and Large Language Models (FLLM2025)

点击查看摘要

Abstract:Systematic reviews require the use of rigorously designed search strategies to ensure both comprehensive retrieval and minimization of bias. Conventional manual approaches, although methodologically systematic, are resource-intensive and susceptible to subjectivity, whereas heuristic and automated techniques frequently under-perform in recall unless supplemented by extensive expert input. We introduce a Large Language Model (LLM)-based chained prompt engineering framework for the automated development of search strategies in systematic reviews. The framework replicates the procedural structure of manual search design while leveraging LLMs to decompose review objectives, extract and formalize PICO elements, generate conceptual representations, expand terminologies, and synthesize Boolean queries. In addition to query construction, the framework exhibits superior performance in generating well-structured PICO elements relative to existing methods, thereby strengthening the foundation for high-recall search strategies. Evaluation on a subset of the LEADSInstruct dataset demonstrates that the framework attains a 0.9 average recall. These results significantly exceed the performance of existing approaches. Error analysis further highlights the critical role of precise objective specification and terminological alignment in optimizing retrieval effectiveness. These findings confirm the capacity of LLM-based pipelines to yield transparent, reproducible, and high-performing search strategies, and highlight their potential as scalable instruments for supporting evidence synthesis and evidence-based practice.

36. 【2602.00010】ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking

链接https://arxiv.org/abs/2602.00010

作者:Mathieu Ciancone,Clovis Varangot-Reille,Marion Schaeffer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, Retrieval-Augmented Generation applications, Information Retrieval part

备注

点击查看摘要

Abstract:In Retrieval-Augmented Generation applications, the Information Retrieval part is central as it provides the contextual information that enables a Large Language Model to generate an appropriate and truthful response. High quality parsing and chunking are critical as efficient data segmentation directly impacts downstream tasks, i.e. Information Retrieval and answer generation. In this paper, we introduce ChunkNorris, a novel heuristic-based technique designed to optimise the parsing and chunking of PDF documents. Our approach does not rely on machine learning and employs a suite of simple yet effective heuristics to achieve high performance with minimal computational overhead. We demonstrate the efficiency of ChunkNorris through a comprehensive benchmark against existing parsing and chunking methods, evaluating criteria such as execution time, energy consumption, and retrieval accuracy. We propose an open-access dataset to produce our results. ChunkNorris outperforms baseline and more advanced techniques, offering a practical and efficient alternative for Information Retrieval tasks. Therefore, this research highlights the potential of heuristic-based methods for real-world, resource-constrained RAG use cases.

37. 【2602.00009】Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA

链接https://arxiv.org/abs/2602.00009

作者:Samuel Thio,Matthew Lewis,Spiros Denaxas,Richard JB Dobson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Electronic health record, Electronic health, significant cognitive burden, Large Language Models, health record

备注: 26 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Electronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While Large Language Models (LLMs) offer transformative potential for data processing, they face significant limitations in clinical settings, particularly regarding context grounding and hallucinations. Current solutions typically isolate retrieval methods focusing either on structured data (SQL/Cypher) or unstructured semantic search but fail to integrate both simultaneously. This work presents MediGRAF (Medical Graph Retrieval Augmented Framework), a novel hybrid Graph RAG system that bridges this gap. By uniquely combining Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, MediGRAF enables natural language querying of the complete patient journey. Using 10 patients from the MIMIC-IV dataset (generating 5,973 nodes and 5,963 relationships), we generated enough nodes and data for patient level question answering (QA), and we evaluated this architecture across varying query complexities. The system demonstrated 100\% recall for factual queries which means all relevant information was retrieved and in the output, while complex inference tasks achieved a mean expert quality score of 4.25/5 with zero safety violations. These results demonstrate that hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments.

38. 【2602.00008】Front-Loaded or Balanced? The Mechanism through Which Review Order Affects Overall Ratings in Premium Service Settings

链接https://arxiv.org/abs/2602.00008

作者:He Wang,Ziyu Zhou,Hanxiang Liu

类目:Information Retrieval (cs.IR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)

关键词:increasingly prevalent landscape, critical factor shaping, factor shaping rating, high-quality service contexts, increasingly prevalent

备注

点击查看摘要

Abstract:In the increasingly prevalent landscape of high-quality service contexts, whether consumer evaluation interfaces adopt a rating-first or review-first sequence has become a critical factor shaping rating authenticity and feedback quality. While prior research has primarily examined review content and sentiment, systematic investigation into how evaluation order influences rating outcomes remains limited. Through exploratory analyses, we find that Letterboxd -- which employs a review-first, rating-after mechanism -- exhibits a more centralized rating distribution with fewer extreme scores, whereas Yelp -- which adopts a rating-first, review-after mechanism -- shows a pronounced bimodal distribution with more polarized ratings. Three controlled experiments further demonstrate that in high-quality service contexts, a rating-first (vs. review-first) interface significantly elevates consumers' overall ratings. Mechanism analyses indicate that cognitive effort and affective heuristics serve as dual pathways: a rating-first (vs. review-first) sequence reduces cognitive effort and heightens affective heuristics, thereby increasing rating scores. Moreover, service quality moderates this process. When service quality is low, the rating-first (vs. review-first) sequence instead leads to lower ratings. This research reveals the psychological mechanisms through which evaluation order affects consumer ratings via cognitive and affective pathways. It extends theoretical understanding of online rating formation and offers practical implications for optimizing platform interface design to enhance rating authenticity and credibility.

39. 【2602.00007】PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering

链接https://arxiv.org/abs/2602.00007

作者:MinGyu Jeon,SuWan Cho,JaeYoung Shu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, complex question answering, Large Language, Language Models, Knowledge Graphs

备注

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.

40. 【2602.00006】FDA AI Search: Making FDA-Authorized AI Devices Searchable

链接https://arxiv.org/abs/2602.00006

作者:Arun Kavishwar,William Lotter

类目:Information Retrieval (cs.IR)

关键词:non-searchable summary PDFs, received marketing authorization, AI-enabled medical devices, summary PDFs, FDA databases

备注: Findings paper presented at the 5th Machine Learning for Health (ML4H) Symposium (2025)

点击查看摘要

Abstract:Over 1,200 AI-enabled medical devices have received marketing authorization from the U.S. FDA, yet identifying devices suited to specific clinical needs remains challenging because the FDA's databases contain only limited metadata and non-searchable summary PDFs. To address this gap, we developed FDA AI Search, a website that enables semantic querying of FDA-authorized AI-enabled devices. The backend includes an embedding-based retrieval system, where LLM-extracted features from authorization summaries are compared to user queries to find relevant matches. We present quantitative and qualitative evaluation that support the effectiveness of the retrieval algorithm compared to keyword-based methods. As FDA-authorized AI devices become increasingly prevalent and their use cases expand, we envision that the tool will assist healthcare providers in identifying devices aligned with their clinical needs and support developers in formulating novel AI applications.

41. 【2602.00005】AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

链接https://arxiv.org/abs/2602.00005

作者:Shuai Wang,Harrisen Scells,Bevan Koopman,Guido Zuccon

类目:Information Retrieval (cs.IR)

关键词:medical systematic reviews, trains large language, generate effective Boolean, effective Boolean queries, large language models

备注

点击查看摘要

Abstract:We present AutoBool, a reinforcement learning (RL) framework that trains large language models (LLMs) to generate effective Boolean queries for medical systematic reviews. Boolean queries are the primary mechanism for literature retrieval in this domain and must achieve high recall while maintaining reasonable precision - a challenging balance that existing prompt-based LLM approaches often struggle to achieve. A major limitation in this space is the lack of high-quality ground-truth Boolean queries for each topic, which makes supervised fine-tuning impractical. AutoBool addresses this challenge by using RL to directly optimize query generation with retrieval measures, without requiring target queries. To support this effort, we create and release the largest dataset of its kind: 65588 topics in total for training and evaluating the task of automatic Boolean query formulation. Experiments on our new dataset and two established datasets (CLEF TAR and Seed Collection) show that AutoBool significantly outperforms zero shot/few shot prompting and matches or exceeds the effectiveness of much larger GPT-based models (e.g., GPT-4o, O3) using smaller backbones. It also approaches effectiveness of expert-authored queries while retrieving 10 to 16 times fewer documents. Ablation studies reveal the critical roles of model backbone, size, decoding temperature, and prompt design. Code and data are available at this https URL.

42. 【2602.00004】C$^2$-Cite: Contextual-Aware Citation Generation for Attributed Large Language Models

链接https://arxiv.org/abs/2602.00004

作者:Yue Yu,Ting Bai,HengZhi Lan,Li Qian,Li Peng,Jie Wu,Wei Liu,Jian Luan,Chuan Shi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)

关键词:attribution technique enhances, enabling users, attribution technique, technique enhances, enhances the credibility

备注: WSDM26

点击查看摘要

Abstract:The attribution technique enhances the credibility of LLMs by adding citations to the generated sentences, enabling users to trace back to the original sources and verify the reliability of the output. However, existing instruction-tuned attributed LLMs often fail to properly interpret the contextual semantics of citation symbols (e.g., [i]) during text generation. This shortcoming arises from their insufficient awareness of the context information surrounding citation markers, which in turn leads to disjointed references and poor integration of retrieved knowledge into the generated content. To address this issue, we propose a novel \textbf{C}ontextual-aware \textbf{C}itation generation framework (\textbf{C$^2$}-\textbf{Cite}) that explicitly integrates the semantic relationships between citation markers and their referenced content. Specifically, a contextual citation alignment mechanism is adopted: it first encodes the retrieved document contexts into the symbol representation of citations, then aligns the marker numbers by decoding information from a citation router function. This mechanism enables the transformation of citation markers from generic placeholders into active knowledge pointers that link to the referenced source information. Experimental results on the ALCE benchmark across three datasets validate our framework C$^2$-Cite++: it outperforms the SOTA baseline by an average of 5.8\% in citation quality and 17.4\% in response correctness. The implementation is publicly available at this https URL

43. 【2602.00003】Efficient Multilingual Search Relevance Modeling in E-Commerce via LLM Mixture-of-Experts

链接https://arxiv.org/abs/2602.00003

作者:Ye Liu,Xu Chen,Wuji Chen,Mang Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:fine-grained semantic nuances, extreme linguistic diversity, search relevance modeling, relevance modeling faces, cross-border e-commerce

备注: 4 pages, 2 figures

点击查看摘要

Abstract:In cross-border e-commerce, search relevance modeling faces the dual challenge of extreme linguistic diversity and fine-grained semantic nuances. Existing approaches typically rely on scaling up a single monolithic Large Language Model (LLM). However, our empirical analysis reveals that single models suffer from uneven capability distributions across regions. For example, excelling in English while underperforming in specific Southeast Asian languages. In this work, we shift the paradigm from scaling a single model to orchestrating heterogeneous experts. We propose a scalable Coarse-grained Mixture-of-Experts (MoE) framework that leverages the inherent complementarity of distinct open-source LLMs (e.g., Qwen, Gemma) without expensive pre-training. Unlike standard token-level MoE, our framework dynamically routes entire queries to specialized experts and, crucially, employs an Information-Preserving Concatenation Fusion strategy. We theoretically posit that preserving the distinct embedding manifolds of heterogeneous experts-rather than compressing them via weighted averaging-is essential for capturing complex relevance signals in a multi-model latent space. On datasets spanning six Southeast Asian markets, our MoE improves AUC by 0.72 percentage points over a dense baseline with the same active parameters. Meanwhile, the optimized pipeline achieves 13.72 queries per second (QPS), a 9% throughput improvement.

44. 【2602.00002】Disentangled Interest Network for Out-of-Distribution CTR Prediction

链接https://arxiv.org/abs/2602.00002

作者:Yu Zheng,Chen Gao,Jianxin Chang,Yanan Niu,Yang Song,Depeng Jin,Meng Wang,Yong Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:online information services, information services, Disentangled Click-Through Rate, estimates the probability, critical task

备注: Accepted by ACM TOIS

点击查看摘要

Abstract:Click-through rate (CTR) prediction, which estimates the probability of a user clicking on a given item, is a critical task for online information services. Existing approaches often make strong assumptions that training and test data come from the same distribution. However, the data distribution varies since user interests are constantly evolving, resulting in the out-of-distribution (OOD) issue. In addition, users tend to have multiple interests, some of which evolve faster than others. Towards this end, we propose Disentangled Click-Through Rate prediction (DiseCTR), which introduces a causal perspective of recommendation and disentangles multiple aspects of user interests to alleviate the OOD issue in recommendation. We conduct a causal factorization of CTR prediction involving user interest, exposure model, and click model, based on which we develop a deep learning implementation for these three causal mechanisms. Specifically, we first design an interest encoder with sparse attention which maps raw features to user interests, and then introduce a weakly supervised interest disentangler to learn independent interest embeddings, which are further integrated by an attentive interest aggregator for prediction. Experimental results on three real-world datasets show that DiseCTR achieves the best accuracy and robustness in OOD recommendation against state-of-the-art approaches, significantly improving AUC and GAUC by over 0.02 and reducing logloss by over 13.7%. Further analyses demonstrate that DiseCTR successfully disentangles user interests, which is the key to OOD generalization for CTR prediction. We have released the code and data at this https URL.

45. 【2507.21934】Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

链接https://arxiv.org/abs/2507.21934

作者:Tianyi Hu,Andrea Morales-Garzón,Jingyi Zheng,Maria Maistro,Daniel Hershcovich

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:original dish essence, ensure cultural appropriateness, provide diverse options, dish essence, Retrieval Augmented Generation

备注

点击查看摘要

Abstract:In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish's essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.

计算机视觉

1. 【2602.02493】PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

链接https://arxiv.org/abs/2602.02493

作者:Zehong Ma,Ruihan Xu,Shiliang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generates images directly, Pixel diffusion generates, avoiding the artifacts, Pixel diffusion, artifacts and bottlenecks

备注: Project Pages: [this https URL](https://zehong-ma.github.io/PixelGen/)

点击查看摘要

Abstract:Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at this https URL.

2. 【2602.02471】Multi-head automated segmentation by incorporating detection head into the contextual layer neural network

链接https://arxiv.org/abs/2602.02471

作者:Edwin Kys,Febian Febian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词:Deep learning based, learning based auto, lacking target structures, Deep learning, produce anatomically implausible

备注: 8 pages, 3 figures, 1 table

点击查看摘要

Abstract:Deep learning based auto segmentation is increasingly used in radiotherapy, but conventional models often produce anatomically implausible false positives, or hallucinations, in slices lacking target structures. We propose a gated multi-head Transformer architecture based on Swin U-Net, augmented with inter-slice context integration and a parallel detection head, which jointly performs slice-level structure detection via a multi-layer perceptron and pixel-level segmentation through a context-enhanced stream. Detection outputs gate the segmentation predictions to suppress false positives in anatomically invalid slices, and training uses slice-wise Tversky loss to address class imbalance. Experiments on the Prostate-Anatomical-Edge-Cases dataset from The Cancer Imaging Archive demonstrate that the gated model substantially outperforms a non-gated segmentation-only baseline, achieving a mean Dice loss of $0.013 \pm 0.036$ versus $0.732 \pm 0.314$, with detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations. In contrast, the non-gated model exhibited higher variability and persistent false positives across all slices. These results indicate that detection-based gating enhances robustness and anatomical plausibility in automated segmentation applications, reducing hallucinated predictions without compromising segmentation quality in valid slices, and offers a promising approach for improving the reliability of clinical radiotherapy auto-contouring workflows.

3. 【2602.02465】MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

链接https://arxiv.org/abs/2602.02465

作者:Jana Zeller,Thaddäus Wiedemer,Fanfei Li,Thomas Klein,Prasanna Mayilvahanan,Matthias Bethge,Felix Wichmann,Ryan Cotterell,Wieland Brendel

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal large language, unified multimodal models, ingest visual information, large language models, native interleaved generation

备注: 9 pages, 8 figures

点击查看摘要

Abstract:Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

4. 【2602.02444】RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

链接https://arxiv.org/abs/2602.02444

作者:Tyler Skow,Alexander Martin,Benjamin Van Durme,Rama Chellappa,Reno Kriz

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:modern retrieval systems, efficient first-stage retriever, refine results, critical component, component of modern

备注

点击查看摘要

Abstract:Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.

5. 【2602.02437】UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

链接https://arxiv.org/abs/2602.02437

作者:Dianyi Wang,Chaofan Ma,Feng Han,Size Wu,Wei Song,Yibin Wang,Zhixiong Zhang,Tianhang Wang,Siyuan Wang,Zhongyu Wei,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unified multimodal models, interconnected reasoning steps, demand deep reasoning, typically treat, multimodal models

备注

点击查看摘要

Abstract:Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

6. 【2602.02426】SelvaMask: Segmenting Trees in Tropical Forests and Beyond

链接https://arxiv.org/abs/2602.02426

作者:Simon-Olivier Duguay,Hugo Baudchon,Etienne Laliberté,Helene Muller-Landau,Gonzalo Rivas-Torres,Arthur Ouaknine

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:global ecological balance, planet tree biodiversity, ecological balance, critical to global, global ecological

备注: 22 pages, 8 figures

点击查看摘要

Abstract:Tropical forests harbor most of the planet's tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.

7. 【2602.02409】Catalyst: Out-of-Distribution Detection via Elastic Scaling

链接https://arxiv.org/abs/2602.02409

作者:Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, neural networks, safe deployment, deployment of deep, deep neural

备注

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical for the safe deployment of deep neural networks. State-of-the-art post-hoc methods typically derive OOD scores from the output logits or penultimate feature vector obtained via global average pooling (GAP). We contend that this exclusive reliance on the logit or feature vector discards a rich, complementary signal: the raw channel-wise statistics of the pre-pooling feature map lost in GAP. In this paper, we introduce Catalyst, a post-hoc framework that exploits these under-explored signals. Catalyst computes an input-dependent scaling factor ($\gamma$) on-the-fly from these raw statistics (e.g., mean, standard deviation, and maximum activation). This $\gamma$ is then fused with the existing baseline score, multiplicatively modulating it -- an ``elastic scaling'' -- to push the ID and OOD distributions further apart. We demonstrate Catalyst is a generalizable framework: it seamlessly integrates with logit-based methods (e.g., Energy, ReAct, SCALE) and also provides a significant boost to distance-based detectors like KNN. As a result, Catalyst achieves substantial and consistent performance gains, reducing the average False Positive Rate by 32.87 on CIFAR-10 (ResNet-18), 27.94% on CIFAR-100 (ResNet-18), and 22.25% on ImageNet (ResNet-50). Our results highlight the untapped potential of pre-pooling statistics and demonstrate that Catalyst is complementary to existing OOD detection approaches.

8. 【2602.02408】ReasonEdit: Editing Vision-Language Models using Human Reasoning

链接https://arxiv.org/abs/2602.02408

作者:Jiaxing Qiu,Kaihua Hou,Roxana Daneshjou,Ahmed Alaa,Thomas Hartvigsen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:altering unrelated behaviors, errors in large, unrelated behaviors, aims to correct, correct errors

备注

点击查看摘要

Abstract:Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

9. 【2602.02402】SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation

链接https://arxiv.org/abs/2602.02402

作者:Mu Huang,Hui Wang,Kerui Ren,Linning Xu,Yunsong Zhou,Mulin Yu,Bo Dai,Jiangmiao Pang

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)

关键词:Simulating deformable objects, dynamics jointly driven, Simulating deformable, rich interactions remains, objects under rich

备注: Project page: [this https URL](https://city-super.github.io/SoMA/)

点击查看摘要

Abstract:Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Existing simulators rely on predefined physics or data-driven dynamics without robot-conditioned control, limiting accuracy, stability, and generalization. This paper presents SoMA, a 3D Gaussian Splat simulator for soft-body manipulation. SoMA couples deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. Modeling interactions over learned Gaussian splats enables controllable, stable long-horizon manipulation and generalization beyond observed trajectories without predefined physical models. SoMA improves resimulation accuracy and generalization on real-world robot manipulation by 20%, enabling stable simulation of complex tasks such as long-horizon cloth folding.

10. 【2602.02401】Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

链接https://arxiv.org/abs/2602.02401

作者:Xinshun Wang,Peiming Li,Ziyi Wang,Zhongbin Fang,Zhichao Deng,Songtao Wu,Jason Li,Mengyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human motion analysis, Human motion, motion, play an essential, computer vision

备注

点击查看摘要

Abstract:Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.

11. 【2602.02393】Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

链接https://arxiv.org/abs/2602.02393

作者:Ruiqi Wu,Xuanhua He,Meng Cheng,Tianyu Yang,Yong Zhang,Zhuoliang Kang,Xunliang Cai,Xiaoming Wei,Chunle Guo,Chongyi Li,Ming-Ming Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complex real-world environments, interactive world model, world model capable, maintaining coherent visual, robust interactive world

备注: 14 pages, 8 figures

点击查看摘要

Abstract:We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

12. 【2602.02388】Personalized Image Generation via Human-in-the-loop Bayesian Optimization

链接https://arxiv.org/abs/2602.02388

作者:Rajalaxmi Rajagopalan,Debottam Dutta,Yu-Lin Wei,Romit Roy Choudhury

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Imagine Alice, ast, Imagine, Preferential Bayesian Optimization, image

备注

点击查看摘要

Abstract:Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.

13. 【2602.02380】Unified Personalized Reward Model for Vision Generation

链接https://arxiv.org/abs/2602.02380

作者:Yibin Wang,Yuhang Zang,Feng Han,Jiazi Bu,Yujie Zhou,Cheng Jin,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, advancements in multimodal, significantly propelled, propelled the development, multimodal reward models

备注: Website: [this https URL](https://codegoat24.github.io/UnifiedReward/flex)

点击查看摘要

Abstract:Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

14. 【2602.02370】Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

链接https://arxiv.org/abs/2602.02370

作者:Uma Meleti,Jeffrey J. Nirschl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate histopathologic interpretation, current deep learning, Accurate histopathologic, deep learning models, clinical decision-making

备注: Accepted for publication at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Accurate histopathologic interpretation is key for clinical decision-making; however, current deep learning models for digital pathology are often overconfident and poorly calibrated in out-of-distribution (OOD) settings, which limit trust and clinical adoption. Safety-critical medical imaging workflows benefit from intrinsic uncertainty-aware properties that can accurately reject OOD input. We implement the Spectral-normalized Neural Gaussian Process (SNGP), a set of lightweight modifications that apply spectral normalization and replace the final dense layer with a Gaussian process layer to improve single-model uncertainty estimation and OOD detection. We evaluate SNGP vs. deterministic and MonteCarlo dropout on six datasets across three biomedical classification tasks: white blood cells, amyloid plaques, and colorectal histopathology. SNGP has comparable in-distribution performance while significantly improving uncertainty estimation and OOD detection. Thus, SNGP or related models offer a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.

15. 【2602.02356】NAB: Neural Adaptive Binning for Sparse-View CT reconstruction

链接https://arxiv.org/abs/2602.02356

作者:Wangduo Xie,Matthew B. Blaschko

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Computed Tomography, plays a vital, vital role, role in inspecting, inspecting the internal

备注

点击查看摘要

Abstract:Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbf{N}eural \textbf{A}daptive \textbf{B}inning (\textbf{NAB}) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters -- including position, size, steepness, and rotation -- via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code will be made available.

16. 【2602.02354】Implicit neural representation of textures

链接https://arxiv.org/abs/2602.02354

作者:Albert Kwok,Zheyuan Hu,Dounia Hammou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Implicit neural representation, Implicit neural, accurate and efficient, neural representation, Implicit

备注: Albert Kwok and Zheyuan Hu contributed equally to this work

点击查看摘要

Abstract:Implicit neural representation (INR) has proven to be accurate and efficient in various domains. In this work, we explore how different neural networks can be designed as a new texture INR, which operates in a continuous manner rather than a discrete one over the input UV coordinate space. Through thorough experiments, we demonstrate that these INRs perform well in terms of image quality, with considerable memory usage and rendering inference time. We analyze the balance between these objectives. In addition, we investigate various related applications in real-time rendering and down-stream tasks, e.g. mipmap fitting and INR-space generation.

17. 【2602.02343】Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

链接https://arxiv.org/abs/2602.02343

作者:Ziwen Xu,Chenyan Wu,Hengyu Sun,Haiwen Hong,Mengru Wang,Yunzhi Yao,Longtao Huang,Hui Xue,Shumin Deng,Zhixuan Chu,Huajun Chen,Ningyu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:making comparison difficult, controlling large language, including local weight, local weight fine-tuning, large language models

备注: Work in progress

点击查看摘要

Abstract:Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at this https URL.

18. 【2602.02341】LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

链接https://arxiv.org/abs/2602.02341

作者:Zhenpeng Huang,Jiaqi Li,Zihan Jia,Xinhao Li,Desen Meng,Lingxue Song,Xi Chen,Liang Li,Limin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Direct Preference Optimization, two-stage Direct Preference, Preference Optimization framework, enables short-context vision-language, robustly understand ultra-long

备注: NeurIPS 2025

点击查看摘要

Abstract:We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.

19. 【2602.02334】VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

链接https://arxiv.org/abs/2602.02334

作者:Fatemeh Zargarbashi,Dhruv Agrawal,Jakob Buhmann,Martin Guay,Stelian Coros,Robert W. Sumner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:subtle stylistic features, Human motion data, rich and complex, Human motion, inherently rich

备注

点击查看摘要

Abstract:Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.

20. 【2602.02318】Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation

链接https://arxiv.org/abs/2602.02318

作者:Xiang Li,Yupeng Zheng,Pengfei Li,Yilun Chen,Ya-Qin Zhang,Wenchao Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces efficiency-accuracy trade-offs, efficiency-accuracy trade-offs, critical geometric, geometric and semantic, semantic understanding

备注: Accepted by RA-L

点击查看摘要

Abstract:Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62$\times$ faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at this https URL.

21. 【2602.02259】Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

链接https://arxiv.org/abs/2602.02259

作者:Hamza Adnan,Matthew T. Jackson,Alexey Zakharov

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling reinforcement learning, extract action-relevant representations, action-relevant representations solely, learn to extract, raw observations

备注

点击查看摘要

Abstract:Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM -- a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.

22. 【2602.02232】LiFlow: Flow Matching for 3D LiDAR Scene Completion

链接https://arxiv.org/abs/2602.02232

作者:Andrea Matteazzi,Dietmar Tutsch

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving scenarios, autonomous driving systems, autonomous driving, driving scenarios, driving systems

备注

点击查看摘要

Abstract:In autonomous driving scenarios, the collected LiDAR point clouds can be challenged by occlusion and long-range sparsity, limiting the perception of autonomous driving systems. Scene completion methods can infer the missing parts of incomplete 3D LiDAR scenes. Recent methods adopt local point-level denoising diffusion probabilistic models, which require predicting Gaussian noise, leading to a mismatch between training and inference initial distributions. This paper introduces the first flow matching framework for 3D LiDAR scene completion, improving upon diffusion-based methods by ensuring consistent initial distributions between training and inference. The model employs a nearest neighbor flow matching loss and a Chamfer distance loss to enhance both local structure and global coverage in the alignment of point clouds. LiFlow achieves state-of-the-art performance across multiple metrics. Code: this https URL.

23. 【2602.02227】Show, Don't Tell: Morphing Latent Reasoning into Image Generation

链接https://arxiv.org/abs/2602.02227

作者:Harold Haodong Chen,Xinxiang Yin,Wen-Jie Shu,Hongfei Zhang,Zixin Zhang,Chenfei Liao,Litao Guo,Qifeng Chen,Ying-Cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, remarkable progress, achieved remarkable, existing methods, methods often lack

备注: Code: [this https URL](https://github.com/EnVision-Research/LatentMorph)

点击查看摘要

Abstract:Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16\%$ on GenEval and $25\%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15\%$ and $11\%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44\%$ and token consumption by $51\%$; and (IV) exhibits $71\%$ cognitive alignment with human intuition on reasoning invocation.

24. 【2602.02223】Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type

链接https://arxiv.org/abs/2602.02223

作者:Junchi Feng,Nikhil Ballem,Mahya Beheshti,Giles Hamilton-Fletcher,Todd Hudson,Maurizio Porfiri,William H. Seiple,John-Ross Rizzo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical character recognition, machine-readable form, Optical character, converts printed, printed or handwritten

备注

点击查看摘要

Abstract:Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone's main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone's main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.

25. 【2602.02222】MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

链接https://arxiv.org/abs/2602.02222

作者:Ruiqi Liu,Manni Cui,Ziheng Qin,Zhiyuan Yan,Ruoxin Chen,Yi Han,Zhiheng Li,Junkai Chen,ZhiJin Chen,Kaiqing Lin,Jialiang Shen,Lubin Weng,Jing Dong,Yan Wang,Shu Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:High-fidelity generative models, High-fidelity generative, posing serious threats, media security, models have narrowed

备注

点击查看摘要

Abstract:High-fidelity generative models have narrowed the perceptual gap between synthetic and real images, posing serious threats to media security. Most existing AI-generated image (AIGI) detectors rely on artifact-based classification and struggle to generalize to evolving generative traces. In contrast, human judgment relies on stable real-world regularities, with deviations from the human cognitive manifold serving as a more generalizable signal of forgery. Motivated by this insight, we reformulate AIGI detection as a Reference-Comparison problem that verifies consistency with the real-image manifold rather than fitting specific forgery cues. We propose MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. MIRROR projects an input into a manifold-consistent ideal reference via sparse linear combination, and uses the resulting residuals as robust detection signals. To evaluate whether detectors reach the "superhuman crossover" required to replace human experts, we introduce the Human-AIGI benchmark, featuring a psychophysically curated human-imperceptible subset. Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On Human-AIGI, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts, and further approaching the human perceptual limit as pretrained backbones scale. The code is publicly available at: this https URL

26. 【2602.02220】LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation

链接https://arxiv.org/abs/2602.02220

作者:Bo Miao,Weijia Liu,Jun Luo,Lachlan Shinnick,Jian Liu,Thomas Hamilton-Smith,Yuhe Yang,Zijie Wu,Vanja Videnovic,Feras Dayoub,Anton van den Hengel

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:fundamental to meaningful, meaningful communication, communication between humans, interpret natural language, natural language instructions

备注

点击查看摘要

Abstract:The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: this https URL

27. 【2602.02214】Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

链接https://arxiv.org/abs/2602.02214

作者:Hongzhou Zhu,Min Zhao,Guande He,Hang Su,Chongxuan Li,Jun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interactive video generation, video diffusion models, real-time interactive video, achieve real-time interactive, pretrained bidirectional video

备注: Project page and the code: \href{ [this https URL](https://thu-ml.github.io/CausalForcing.github.io/) }{ [this https URL](https://thu-ml.github.io/CausalForcing.github.io/) }

点击查看摘要

Abstract:To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{this https URL}{this https URL}

28. 【2602.02212】MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

链接https://arxiv.org/abs/2602.02212

作者:Zheyuan Zhou,Liang Du,Zixun Sun,Xiaoyu Zhou,Ruimin Ye,Qihao Chen,Yinda Chen,Lemiao Qiu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing approaches remain, real-time unpredictable interactions, involve real-time unpredictable, approaches remain inefficient, extracting action-critical signals

备注

点击查看摘要

Abstract:Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.

29. 【2602.02193】SSI-DM: Singularity Skipping Inversion of Diffusion Models

链接https://arxiv.org/abs/2602.02193

作者:Chen Min,Enze Jiang,Jishen Peng,Zheng Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:early noising steps, Inverting real images, poor editability due, Inverting real, noising steps

备注

点击查看摘要

Abstract:Inverting real images into the noise space is essential for editing tasks using diffusion models, yet existing methods produce non-Gaussian noise with poor editability due to the inaccuracy in early noising steps. We identify the root cause: a mathematical singularity that renders inversion fundamentally ill-posed. We propose Singularity Skipping Inversion of Diffusion Models (SSI-DM), which bypasses this singular region by adding small noise before standard inversion. This simple approach produces inverted noise with natural Gaussian properties while maintaining reconstruction fidelity. As a plug-and-play technique compatible with general diffusion models, our method achieves superior performance on public image datasets for reconstruction and interpolation tasks, providing a principled and efficient solution to diffusion model inversion.

30. 【2602.02186】Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision

链接https://arxiv.org/abs/2602.02186

作者:Ziqiao Weng,Jiancheng Yang,Kangxian Xie,Bo Zhou,Weidong Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:images frequently exhibit, substantially degrades downstream, exhibit topological incompleteness, degrades downstream anatomical, frequently exhibit topological

备注: 18 pages, 7 figures

点击查看摘要

Abstract:Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing or explicit graph reasoning, leading to limited efficiency and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward this http URL experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications. Code and data will be available at this https URL.

31. 【2602.02185】Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

链接https://arxiv.org/abs/2602.02185

作者:Yu Zeng,Wenxuan Huang,Zhen Fang,Shuang Chen,Yufan Shen,Yishuo Cai,Xiaoman Wang,Zhenfei Yin,Lin Chen,Zehui Chen,Shiting Huang,Yiming Zhao,Yao Hu,Philip Torr,Wanli Ouyang,Shaosheng Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Multimodal Large Language, Large Language, complex visual-textual fact-finding, Language Models

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in this https URL.

32. 【2602.02175】CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

链接https://arxiv.org/abs/2602.02175

作者:Xinquan Yu,Wei Lu,Xiangyang Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered growing attention, threat of misinformation, growing attention, garnered growing, Refine Patch Selection

备注

点击查看摘要

Abstract:To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

33. 【2602.02171】Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks

链接https://arxiv.org/abs/2602.02171

作者:Lu Cao,Xiquan He,Junying Zeng,Chaoyun Mai,Min Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited sample size, datasets severely restrict, lung nodule, lung nodule texture, insufficient diversity

备注

点击查看摘要

Abstract:The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.

34. 【2602.02163】Reg4Pru: Regularisation Through Random Token Routing for Token Pruning

链接https://arxiv.org/abs/2602.02163

作者:Julian Wyatt,Ronald Clark,Irina Voiculescu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformers are widely, modern vision models, vision models due, size and generalisability, widely adopted

备注: 11 pages, 7 figures

点击查看摘要

Abstract:Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.

35. 【2602.02156】LoopViT: Scaling Visual ARC with Looped Transformers

链接https://arxiv.org/abs/2602.02156

作者:Wen-Jie Shu,Xuerui Qiu,Rui-Jie Zhu,Harold Haodong Chen,Yexin Liu,Harry Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraged vision transformers, Recent advances, tackle the ARC-AGI, leveraged vision, vision transformers

备注: 8 pages, 11 figures

点击查看摘要

Abstract:Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at this https URL.

36. 【2602.02154】Deep learning enables urban change profiling through alignment of historical maps

链接https://arxiv.org/abs/2602.02154

作者:Sidi Wu,Yizi Chen,Maurizio Gribaudi,Konrad Schindler,Clément Mallet,Julien Perret,Lorenz Hurni

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Earth observation technologies, modern Earth observation, Prior to modern, modern Earth, Earth observation

备注: 40 pages

点击查看摘要

Abstract:Prior to modern Earth observation technologies, historical maps provide a unique record of long-term urban transformation and offer a lens on the evolving identity of cities. However, extracting consistent and fine-grained change information from historical map series remains challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches. We propose a fully automated, deep learning-based framework for fine-grained urban change analysis from large collections of historical maps, built on a modular design that integrates dense map alignment, multi-temporal object detection, and change profiling. This framework shifts the analysis of historical maps from ad hoc visual comparison toward systematic, quantitative characterization of urban change. Experiments demonstrate the robust performance of the proposed alignment and object detection methods. Applied to Paris between 1868 and 1937, the framework reveals the spatial and temporal heterogeneity in urban transformation, highlighting its relevance for research in the social sciences and humanities. The modular design of our framework further supports adaptation to diverse cartographic contexts and downstream applications.

37. 【2602.02142】FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation

链接https://arxiv.org/abs/2602.02142

作者:Ruiteng Zhao,Wenshuo Wang,Yicheng Ma,Xiaocong Li,Francis E.H. Tay,Marcelo H. Ang Jr.,Haiyue Zhu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:enables fine-grained perception, Force Distillation Module, Force, crucial modality, enables fine-grained

备注

点击查看摘要

Abstract:Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.

38. 【2602.02130】Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework

链接https://arxiv.org/abs/2602.02130

作者:Lukas Zimmermann,Michael Rauter,Maximilian Schmid,Dietmar Georg,Barbara Knäusl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scans remains unattainable, separately acquired scans, acquired scans remains, Normalized Mutual Information, Supervised synthetic

备注

点击查看摘要

Abstract:Supervised synthetic CT generation from CBCT requires registered training pairs, yet perfect registration between separately acquired scans remains unattainable. This registration bias propagates into trained models and corrupts standard evaluation metrics. This may suggest that superior benchmark performance indicates better reproduction of registration artifacts rather than anatomical fidelity. We propose physics-based CBCT simulation to provide geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics against input CBCT rather than biased ground truth. On two independent pelvic datasets, models trained on synthetic data achieved superior geometric alignment (Normalized Mutual Information: 0.31 vs 0.22) despite lower conventional intensity scores. Intensity metrics showed inverted correlations with clinical assessment for deformably registered data, while Normalized Mutual Information consistently predicted observer preference across registration methodologies (rho = 0.31, p 0.001). Clinical observers preferred synthetic-trained outputs in 87% of cases, demonstrating that geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements.

39. 【2602.02124】oxicity Assessment in Preclinical Histopathology via Class-Aware Mahalanobis Distance for Known and Novel Anomalies

链接https://arxiv.org/abs/2602.02124

作者:Olga Graf,Dhrupal Patel,Peter Groß,Charlotte Lempp,Matthias Hein,Fabian Heinemann

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Drug-induced toxicity remains, early clinical trials, Drug-induced toxicity, clinical trials, early clinical

备注

点击查看摘要

Abstract:Drug-induced toxicity remains a leading cause of failure in preclinical development and early clinical trials. Detecting adverse effects at an early stage is critical to reduce attrition and accelerate the development of safe medicines. Histopathological evaluation remains the gold standard for toxicity assessment, but it relies heavily on expert pathologists, creating a bottleneck for large-scale screening. To address this challenge, we introduce an AI-based anomaly detection framework for histopathological whole-slide images (WSIs) in rodent livers from toxicology studies. The system identifies healthy tissue and known pathologies (anomalies) for which training data is available. In addition, it can detect rare pathologies without training data as out-of-distribution (OOD) findings. We generate a novel dataset of pixelwise annotations of healthy tissue and known pathologies and use this data to fine-tune a pre-trained Vision Transformer (DINOv2) via Low-Rank Adaptation (LoRA) in order to do tissue segmentation. Finally, we extract features for OOD detection using the Mahalanobis distance. To better account for class-dependent variability in histological data, we propose the use of class-specific thresholds. We optimize the thresholds using the mean of the false negative and false positive rates, resulting in only 0.16\% of pathological tissue classified as healthy and 0.35\% of healthy tissue classified as pathological. Applied to mouse liver WSIs with known toxicological findings, the framework accurately detects anomalies, including rare OOD morphologies. This work demonstrates the potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development.

40. 【2602.02123】MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

链接https://arxiv.org/abs/2602.02123

作者:Yangyi Cao,Yuanhang Li,Lan Chen,Qi Mao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:minute-level video editing, flow-based framework, unique challenges, challenges of minute-level, Velocity Blend rectifies

备注

点击查看摘要

Abstract:We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.

41. 【2602.02114】Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training

链接https://arxiv.org/abs/2602.02114

作者:Xin Ding,Yun Chen,Sen Zhang,Kao Zhang,Nenglun Chen,Peibei Cao,Yongwei Wang,Fei Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Continuous Conditional Diffusion, continuous regression labels, Conditional Diffusion Model, Continuous Conditional, continuous regression

备注

点击查看摘要

Abstract:Continuous Conditional Diffusion Model (CCDM) is a diffusion-based framework designed to generate high-quality images conditioned on continuous regression labels. Although CCDM has demonstrated clear advantages over prior approaches across a range of datasets, it still exhibits notable limitations and has recently been surpassed by a GAN-based method, namely CcGAN-AVAR. These limitations mainly arise from its reliance on an outdated diffusion framework and its low sampling efficiency due to long sampling trajectories. To address these issues, we propose an improved CCDM framework, termed iCCDM, which incorporates the more advanced \textit{Elucidated Diffusion Model} (EDM) framework with substantial modifications to improve both generation quality and sampling efficiency. Specifically, iCCDM introduces a novel matrix-form EDM formulation together with an adaptive vicinal training strategy. Extensive experiments on four benchmark datasets, spanning image resolutions from $64\times64$ to $256\times256$, demonstrate that iCCDM consistently outperforms existing methods, including state-of-the-art large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, FLUX.1, and Qwen-Image), achieving higher generation quality while significantly reducing sampling cost.

42. 【2602.02110】An Empirical Study of World Model Quantization

链接https://arxiv.org/abs/2602.02110

作者:Zhongqian Fu,Tianyi Zhao,Kai Han,Hang Zhou,Xinghao Chen,Yunhe Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:compact latent space, World models, World models learn, environment dynamics, enabling agents

备注

点击查看摘要

Abstract:World models learn an internal representation of environment dynamics, enabling agents to simulate and reason about future states within a compact latent space for tasks such as planning, prediction, and inference. However, running world models rely on hevay computational cost and memory footprint, making model quantization essential for efficient deployment. To date, the effects of post-training quantization (PTQ) on world models remain largely unexamined. In this work, we present a systematic empirical study of world model quantization using DINO-WM as a representative case, evaluating diverse PTQ methods under both weight-only and joint weight-activation settings. We conduct extensive experiments on different visual planning tasks across a wide range of bit-widths, quantization granularities, and planning horizons up to 50 iterations. Our results show that quantization effects in world models extend beyond standard accuracy and bit-width trade-offs: group-wise weight quantization can stabilize low-bit rollouts, activation quantization granularity yields inconsistent benefits, and quantization sensitivity is highly asymmetric between encoder and predictor modules. Moreover, aggressive low-bit quantization significantly degrades the alignment between the planning objective and task success, leading to failures that cannot be remedied by additional optimization. These findings reveal distinct quantization-induced failure modes in world model-based planning and provide practical guidance for deploying quantized world models under strict computational constraints. The code will be available at this https URL.

43. 【2602.02107】acher-Guided Student Self-Knowledge Distillation Using Diffusion Model

链接https://arxiv.org/abs/2602.02107

作者:Yu Wang,Chuanguang Yang,Zhulin An,Weilun Feng,Jiarui Zhao,Chengqing Yu,Libo Huang,Boyu Diao,Yongjun Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:teacher, student, align feature information, loss functions, processing and loss

备注

点击查看摘要

Abstract:Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.

44. 【2602.02092】FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

链接https://arxiv.org/abs/2602.02092

作者:FSVideo Team,Qingyu Chen,Zhiyuan Fang,Haibin Huang,Xinwei Huang,Tong Jin,Minxuan Lin,Bo Liu,Celong Liu,Chongyang Ma,Xing Mei,Xiaohui Shen,Yaojie Shen,Fuwen Tan,Angtian Wang,Xiao Yang,Yiding Yang,Jiamin Yuan,Lingxi Zhang,Yuxin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fast speed transformer-based, introduce FSVideo, speed transformer-based, fast speed, few-step DIT upsampler

备注: Project Page: [this https URL](https://kingofprank.github.io/fsvideo/)

点击查看摘要

Abstract:We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

45. 【2602.02089】UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction

链接https://arxiv.org/abs/2602.02089

作者:Changbai Li,Haodong Zhu,Hanlin Chen,Xiuping Liang,Tongfei Chen,Shuwei Shao,Linlin Yang,Huobin Tan,Baochang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, enables high-quality, environments gives rise, rise to critical, Splatting

备注: ICLR 2026

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.

46. 【2602.02067】Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data

链接https://arxiv.org/abs/2602.02067

作者:Nikola Cenikj,Özgün Turgut,Alexander Müller,Alexander Steger,Jan Kehrer,Marcus Brugger,Daniel Rueckert,Eimo Martens,Philip Müller

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multiple angiography views, single angiography view, cardiovascular disease, diagnosed by analyzing, Coronary artery stenosis

备注

点击查看摘要

Abstract:Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views. Although numerous deep-learning models have been proposed for stenosis detection from a single angiography view, their performance heavily relies on expensive view-level annotations, which are often not readily available in hospital systems. Moreover, these models fail to capture the temporal dynamics and dependencies among multiple views, which are crucial for clinical diagnosis. To address this, we propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification. Trained on a real-world clinical dataset, using patient-level supervision and without any view-level annotations, SegmentMIL jointly predicts the presence of stenosis and localizes the affected anatomical region, distinguishing between the right and left coronary arteries and their respective segments. SegmentMIL obtains high performance on internal and external evaluations and outperforms both view-level models and classical MIL baselines, underscoring its potential as a clinically viable and scalable solution for coronary stenosis diagnosis. Our code is available at this https URL.

47. 【2602.02043】Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

链接https://arxiv.org/abs/2602.02043

作者:Cristian Sbrolli,Matteo Matteucci,Toshihiko Yamasaki

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Modern Vision-Language Models, Modern Vision-Language, blue sphere, red sphere, red cube

备注

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing "a red cube and a blue sphere" with "a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., "a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g., "In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel "Confusion Benchmark" reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (this https URL).

48. 【2602.02033】One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

链接https://arxiv.org/abs/2602.02033

作者:Shuo Lu,Haohan Wang,Wei Feng,Weizhen Wang,Shen Zhang,Yaoyu Li,Ao Ma,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Bing Zhan,Yuan Xu,Huizai Yao,Yongcan Yu,Chenyang Si,Jian Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:existing approaches adopt, Click-Through Rate, neglecting preference diversity, Advertising image generation, strategy that optimizes

备注

点击查看摘要

Abstract:Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at this https URL.

49. 【2602.02014】Rethinking Genomic Modeling Through Optical Character Recognition

链接https://arxiv.org/abs/2602.02014

作者:Hongxin Xiang,Pengsen Ma,Yunkang Cao,Di Yu,Haowen Chen,Xinyu Yang,Xiangxiang Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:largely adopt large, adopt large language, foundation models largely, models largely adopt, Optical Character Recognition

备注

点击查看摘要

Abstract:Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.

50. 【2602.02004】ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

链接https://arxiv.org/abs/2602.02004

作者:Gongli Xi,Kun Wang,Zeming Gao,Huahui Yi,Haolang Lu,Ye Tian,Wendong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:explicit long-chain inference, Large multimodal reasoning, Large multimodal, models solve challenging, solve challenging visual

备注: 20 pages, 7 figures

点击查看摘要

Abstract:Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.

51. 【2602.02002】UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

链接https://arxiv.org/abs/2602.02002

作者:Guosheng Zhao,Yaozeng Wang,Xiaofeng Wang,Zheng Zhu,Tingdong Yu,Guan Huang,Yongchen Zai,Ji Jiao,Changliang Xue,Xiaole Wang,Zhen Yang,Futang Zhu,Xingang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated significant promise, demonstrated significant, significant promise, promise for data, autonomous driving

备注: 16 pages, 7 figures

点击查看摘要

Abstract:World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream

52. 【2602.02000】SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors

链接https://arxiv.org/abs/2602.02000

作者:Bing He,Jingnan Gao,Yunuo Chen,Ning Cao,Gang Chen,Zhengxue Cheng,Li Song,Wenjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging task due, Gaussian Splatting, sparse images remains, recovering accurate geometry, images remains

备注: ICLR 2026

点击查看摘要

Abstract:Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs. Project page: this https URL

53. 【2602.01991】Leveraging Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models

链接https://arxiv.org/abs/2602.01991

作者:Pablo Domingo-Gregorio,Javier Ruiz-Hidalgo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:textual descriptions, Diffusion models emerged, producing high-quality images, producing high-quality, models emerged

备注

点击查看摘要

Abstract:Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.

54. 【2602.01984】Enhancing Multi-Image Understanding through Delimiter Token Scaling

链接https://arxiv.org/abs/2602.01984

作者:Minyoung Lee,Yeji Park,Dongjun Hwang,Yejin Kim,Seong Joon Oh,Junsuk Choe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, achieve strong performance, achieve strong, provided as input

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

55. 【2602.01976】FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning

链接https://arxiv.org/abs/2602.01976

作者:Hongwei Yan,Guanglong Sun,Kanglei Zhou,Qian Li,Liyuan Wang,Yi Zhong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:General continual learning, non-stationary data streams, General continual, clear task boundaries, learn from single-pass

备注: 33 pages. Accepted by ICLR 2026

点击查看摘要

Abstract:General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly's hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt's superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively. Our source code is available at this https URL.

56. 【2602.01973】Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

链接https://arxiv.org/abs/2602.01973

作者:Muli Yang,Gabriel James Goenawan,Henan Wang,Huaiyuan Qin,Chenghao Xu,Yanhua Yang,Fen Fang,Ying Sun,Joo-Hwee Lim,Hongyuan Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:exhibit systematic bias, frequently misclassifying fake, balanced datasets, test time, frequently misclassifying

备注: AAAI 2026. Code: [this https URL](https://github.com/muliyangm/AIGI-Det-Calib)

点击查看摘要

Abstract:Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model's logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at this https URL.

57. 【2602.01954】Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

链接https://arxiv.org/abs/2602.01954

作者:Shuai Yang,Ziyue Huang,Jiaxin Chen,Qingjie Liu,Yunhong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pretraining-induced text-visual alignment, sensing commonly relies, remote sensing commonly, inference-time category queries, remote sensing

备注

点击查看摘要

Abstract:Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.

58. 【2602.01951】Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network

链接https://arxiv.org/abs/2602.01951

作者:Shuyang Wu,Yifu Qiu,Ines P. Nearchou,Sandrine Prost,Jonathan A Fallowfield,Hakan Bilen,Timothy J Kendall

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple-instance Learning, undertake computational pathology, computational pathology, Multi-scale Pyramidal Network, undertake computational

备注

点击查看摘要

Abstract:Multiple-instance Learning (MIL) is commonly used to undertake computational pathology (CPath) tasks, and the use of multi-scale patches allows diverse features across scales to be learned. Previous studies using multi-scale features in clinical applications rely on multiple inputs across magnifications with late feature fusion, which does not retain the link between features across scales while the inputs are dependent on arbitrary, manufacturer-defined magnifications, being inflexible and computationally expensive. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), which is plug-and-play over attention-based MIL that introduces progressive multi-scale analysis on WSI. Our MSPN consists of (1) grid-based remapping that uses high magnification features to derive coarse features and (2) the coarse guidance network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks using 4 clinically relevant tasks across 3 types of foundation model, as well as the pre-trained MIL framework. We show that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.

59. 【2602.01949】Boundary-Constrained Diffusion Models for Floorplan Generation: Balancing Realism and Diversity

链接https://arxiv.org/abs/2602.01949

作者:Leonardo Stoppani,Davide Bacciu,Shahab Mokarizadeh

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:automated floorplan generation, producing highly realistic, Fréchet Inception Distance, Diffusion models, realistic layouts conditioned

备注: Accepted at ESANN 2026

点击查看摘要

Abstract:Diffusion models have become widely popular for automated floorplan generation, producing highly realistic layouts conditioned on user-defined constraints. However, optimizing for perceptual metrics such as the Fréchet Inception Distance (FID) causes limited design diversity. To address this, we propose the Diversity Score (DS), a metric that quantifies layout diversity under fixed constraints. Moreover, to improve geometric consistency, we introduce a Boundary Cross-Attention (BCA) module that enables conditioning on building boundaries. Our experiments show that BCA significantly improves boundary adherence, while prolonged training drives diversity collapse undiagnosed by FID, revealing a critical trade-off between realism and diversity. Out-Of-Distribution evaluations further demonstrate the models' reliance on dataset priors, emphasizing the need for generative systems that explicitly balance fidelity, diversity, and generalization in architectural design tasks.

60. 【2602.01930】LIEREx: Language-Image Embeddings for Robotic Exploration

链接https://arxiv.org/abs/2602.01930

作者:Felix Igelbrink,Lennart Niecksch,Marian Renz,Martin Günther,Martin Atzmueller

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:exploring unmapped areas, finding specific objects, finding specific, unmapped areas, robot to reason

备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in KI - Künstliche Intelligenz, and is available online at [this https URL](https://doi.org/10.1007/s13218-026-00902-6)

点击查看摘要

Abstract:Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric representations but are often constrained by pre-designed symbolic vocabularies. The reliance on fixed object classes makes it impractical to handle out-of-distribution knowledge not defined at design time. Recent advances in Vision-Language Foundation Models, such as CLIP, enable open-set mapping, where objects are encoded as high-dimensional embeddings rather than fixed labels. In LIEREx, we integrate these VLFMs with established 3D Semantic Scene Graphs to enable target-directed exploration by an autonomous agent in partially unknown environments.

61. 【2602.01906】DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification

链接https://arxiv.org/abs/2602.01906

作者:Farhan Ullah,Irfan Ullah,Khalil Khan,Giovanni Pau,JaKeoung Koo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:labeled training samples, challenging task due, limited labeled training, Dynamic Context Attention, high spectral dimensionality

备注

点击查看摘要

Abstract:Hyperspectral image classification (HSIC) is a challenging task due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled training samples. Although transformer-based models have shown strong potential for HSIC, existing approaches often struggle to achieve sufficient spectral discriminability while maintaining computational efficiency. To address these limitations, we propose a novel DSXFormer, a novel dual-pooling spectral squeeze-expansion transformer with Dynamic Context Attention for HSIC. The proposed DSXFormer introduces a Dual-Pooling Spectral Squeeze-Expansion (DSX) block, which exploits complementary global average and max pooling to adaptively recalibrate spectral feature channels, thereby enhancing spectral discriminability and inter-band dependency modeling. In addition, DSXFormer incorporates a Dynamic Context Attention (DCA) mechanism within a window-based transformer architecture to dynamically capture local spectral-spatial relationships while significantly reducing computational overhead. The joint integration of spectral dual-pooling squeeze-expansion and DCA enables DSXFormer to achieve an effective balance between spectral emphasis and spatial contextual representation. Furthermore, patch extraction, embedding, and patch merging strategies are employed to facilitate efficient multi-scale feature learning. Extensive experiments conducted on four widely used hyperspectral benchmark datasets, including Salinas (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), demonstrate that DSXFormer consistently outperforms state-of-the-art methods, achieving classification accuracies of 99.95%, 98.91%, 99.85%, and 98.52%, respectively.

62. 【2602.01905】Learning Sparse Visual Representations via Spatial-Semantic Factorization

链接https://arxiv.org/abs/2602.01905

作者:Theodore Zhengde Zhao,Sid Kiblawi,Jianwei Yang,Naoto Usuyama,Reuben Tan,Noel C Codella,Tristan Naumann,Hoifung Poon,Mu Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Self-supervised learning, faces a fundamental, fundamental conflict, understanding and image, SSL

备注

点击查看摘要

Abstract:Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at this https URL.

63. 【2602.01901】Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

链接https://arxiv.org/abs/2602.01901

作者:Jiedong Zhuang,Lu Lu,Ming Dai,Rui Hu,Jian Chen,Qiang Liu,Haoji Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, Multimodal large, large language models, visual tokens, exorbitant inference costs

备注: Accepted by AAAI26

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.

64. 【2602.01899】Multi-Task Learning for Robot Perception with Imbalanced Data

链接https://arxiv.org/abs/2602.01899

作者:Ozgur Erkent

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-task problem solving, Multi-task problem, limited resource, important feature, problem solving

备注: 16 pages

点击查看摘要

Abstract:Multi-task problem solving has been shown to improve the accuracy of the individual tasks, which is an important feature for robots, as they have a limited resource. However, when the number of labels for each task is not equal, namely imbalanced data exist, a problem may arise due to insufficient number of samples, and labeling is not very easy for mobile robots in every environment. We propose a method that can learn tasks even in the absence of the ground truth labels for some of the tasks. We also provide a detailed analysis of the proposed method. An interesting finding is related to the interaction of the tasks. We show a methodology to find out which tasks can improve the performance of other tasks. We investigate this by training the teacher network with the task outputs such as depth as inputs. We further provide empirical evidence when trained with a small amount of data. We use semantic segmentation and depth estimation tasks on different datasets, NYUDv2 and Cityscapes.

65. 【2602.01881】ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding

链接https://arxiv.org/abs/2602.01881

作者:Ye Chen,Yupeng Zhu,Xiongzhen Zhang,Zhewen Wan,Yingzhe Li,Wenjun Zhang,Bingbing Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Prevailing image representation, Gaussian primitives, manual editing effort, heavy manual editing, making fine-grained manipulation

备注

点击查看摘要

Abstract:Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.

66. 【2602.01864】rust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling

链接https://arxiv.org/abs/2602.01864

作者:Yuan Wang,Yuhao Wan,Siming Zheng,Bo Li,Qibin Hou,Peng-Tao Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:explored reference-based super-resolution, diffusion-based image restoration, Recent works, reference-based super-resolution, works have explored

备注: 26 pages, 19 figures. Accepted to ICLR 2026

点击查看摘要

Abstract:Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a "Trust but Verify" principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.

67. 【2602.01854】Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection

链接https://arxiv.org/abs/2602.01854

作者:A S M Sharifuzzaman Sagar,Mohammed Bennamoun,Farid Boussaid,Naeha Sharif,Lian Xu,Shaaban Sahmoud,Ali Kishk

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:contextual claim jointly, claim jointly expressed, deception usually arises, jointly expressed, deepfake detectors

备注

点击查看摘要

Abstract:In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.

68. 【2602.01851】How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

链接https://arxiv.org/abs/2602.01851

作者:Huanyu Zhang,Xuehai Bai,Chengzu Li,Chen Liang,Haochen Tian,Haodong Li,Ruichuan An,Yifan Zhang,Anna Korhonen,Zhang Zhang,Liang Wang,Tieniu Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent generative models, achieved remarkable progress, Recent generative, achieved remarkable, remarkable progress

备注: [this https URL](https://vibe-benchmark.github.io/)

点击查看摘要

Abstract:Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

69. 【2602.01850】WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

链接https://arxiv.org/abs/2602.01850

作者:Pei Li,Jiaxi Yin,Lei Ouyang,Shihan Pan,Ge Wang,Han Ding,Fei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human Activity Recognition, IMU-based Human Activity, Activity Recognition, ubiquitous computing applications, Human Activity

备注: Under Review. 28 pages, 9 figures, 6 tables

点击查看摘要

Abstract:IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.

70. 【2602.01844】CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

链接https://arxiv.org/abs/2602.01844

作者:Yuliang Zhan,Jian Li,Wenbing Huang,Wenbing Huang,Yang Liu,Hao Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complex dynamic systems, simulating complex dynamic, Deep learning, Cloth Dynamics, demonstrated remarkable capabilities

备注: ICLR 2026

点击查看摘要

Abstract:Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at this https URL. Visualization results are available at this https URL}.%\footnote{As in this example.

71. 【2602.01843】SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection

链接https://arxiv.org/abs/2602.01843

作者:Qian Xu,Xi Li,Fei Gao,Jie Guo,Haojuan Yuan,Shuaipeng Fan,Mingjin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Infrared small target, infrared small targets, surveillance and early-warning, video-mode tracking, crucial for surveillance

备注

点击查看摘要

Abstract:Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.

72. 【2602.01836】Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery

链接https://arxiv.org/abs/2602.01836

作者:Yin Wu,Daniel Slieter,Carl Esselborn,Ahmed Abouelazm,Tsung Yuan Tseng,J. Marius Zöllner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deploying ADAS, ADAS and ADS, introduce domain shifts, ADS across countries, degrade perception performance

备注

点击查看摘要

Abstract:Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.

73. 【2602.01816】Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies

链接https://arxiv.org/abs/2602.01816

作者:Wenjin Hou,Wei Liu,Han Hu,Xiaoxiao Sun,Serena Yeung-Levy,Hehe Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, shown remarkable proficiency

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages'' where the model's logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.

74. 【2602.01814】GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation

链接https://arxiv.org/abs/2602.01814

作者:Xiao Liang,Yunzhu Zhang,Linchao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, denoising process remains, video generation, high computational cost, major bottleneck

备注

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.

75. 【2602.01812】LDRNet: Large Deformation Registration Model for Chest CT Registration

链接https://arxiv.org/abs/2602.01812

作者:Cheng Wang,Qiyu Gao,Fandong Zhang,Shu Zhang,Yizhou Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:based medical image, http URL, learning based medical, deep learning based, registration algorithms focus

备注

点击查看摘要

Abstract:Most of the deep learning based medical image registration algorithms focus on brain image registration this http URL with brain registration, the chest CT registration has larger deformation, more complex background and region over-lap. In this paper, we propose a fast unsupervised deep learning method, LDRNet, for large deformation image registration of chest CT images. We first predict a coarse resolution registration field, then refine it from coarse to fine. We propose two innovative technical components: 1) a refine block that is used to refine the registration field in different resolutions, 2) a rigid block that is used to learn transformation matrix from high-level features. We train and evaluate our model on the private dataset and public dataset SegTHOR. We compare our performance with state-of-the-art traditional registration methods as well as deep learning registration models VoxelMorph, RCN, and LapIRN. The results demonstrate that our model achieves state-of-the-art performance for large deformation images registration and is much faster.

76. 【2602.01805】FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing

链接https://arxiv.org/abs/2602.01805

作者:Menglin Han,Zhangkai Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted increasing attention, Training-free image editing, training data, Training-free image, attracted increasing

备注

点击查看摘要

Abstract:Training-free image editing has attracted increasing attention for its efficiency and independence from training data. However, existing approaches predominantly rely on inversion-reconstruction trajectories, which impose an inherent trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with the edit prompt. Previous attempts to address this issue typically employ backbone-specific feature manipulations, limiting general applicability. To address these challenges, we propose FlowBypass, a novel and analytical framework grounded in Rectified Flow that constructs a bypass directly connecting inversion and reconstruction trajectories, thereby mitigating error accumulation without relying on feature manipulations. We provide a formal derivation of two trajectories, from which we obtain an approximate bypass formulation and its numerical solution, enabling seamless trajectory transitions. Extensive experiments demonstrate that FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.

77. 【2602.01801】Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

链接https://arxiv.org/abs/2602.01801

作者:Dvir Samuel,Issar Tzachor,Matan Levy,Micahel Green,Gal Chechik,Rami Ben-Ari

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:neural game engines, interactive neural game, enable streaming generation, models enable streaming, Autoregressive video diffusion

备注: Project Page: [this https URL](https://dvirsamuel.github.io/fast-auto-regressive-video/)

点击查看摘要

Abstract:Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

78. 【2602.01799】Spatio-Temporal Transformers for Long-Term NDVI Forecasting

链接https://arxiv.org/abs/2602.01799

作者:Ido Faran,Nathan S. Netanyahu,Maxim Shoshany

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Long-term satellite image, faces significant challenges, Long Term Forecasting, Long-term satellite, image time series

备注

点击查看摘要

Abstract:Long-term satellite image time series (SITS) analysis in heterogeneous landscapes faces significant challenges, particularly in Mediterranean regions where complex spatial patterns, seasonal variations, and multi-decade environmental changes interact across different scales. This paper presents the Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF ), an extended framework that advances beyond purely temporal analysis to integrate spatial context modeling with temporal sequence prediction. STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture, capturing both local neighborhood relationships and regional climate influences. The framework employs comprehensive self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies, enabling robust model training from 40 years of unlabeled Landsat imagery. Unlike autoregressive approaches, STT-LTF directly predicts arbitrary future time points without error accumulation, incorporating spatial patch embeddings, cyclical temporal encoding, and geographic coordinates to learn complex dependencies across heterogeneous Mediterranean ecosystems. Experimental evaluation on Landsat data (1984-2024) demonstrates that STT-LTF achieves a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers. The framework's ability to handle irregular temporal sampling and variable prediction horizons makes it particularly suitable for analysis of heterogeneous landscapes experiencing rapid ecological transitions.

79. 【2602.01783】Automated Discontinuity Set Characterisation in Enclosed Rock Face Point Clouds Using Single-Shot Filtering and Cyclic Orientation Transformation

链接https://arxiv.org/abs/2602.01783

作者:Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assessing rock-mass stability, exposed rock faces, rock faces, excavation safety, structural discontinuity sets

备注

点击查看摘要

Abstract:Characterisation of structural discontinuity sets in exposed rock faces of underground mine cavities is essential for assessing rock-mass stability, excavation safety, and operational efficiency. UAV and other mobile laser-scanning techniques provide efficient means of collecting point clouds from rock faces. However, the development of a robust and efficient approach for automatic characterisation of discontinuity sets in real-world scenarios, like fully enclosed rock faces in cavities, remains an open research problem. In this study, a new approach is proposed for automatic discontinuity set characterisation that uses a single-shot filtering strategy, an innovative cyclic orientation transformation scheme and a hierarchical clustering technique. The single-shot filtering step isolates planar regions while robustly suppressing noise and high-curvature artefacts in one pass using a signal-processing technique. To address the limitations of Cartesian clustering on polar orientation data, a cyclic orientation transformation scheme is developed, enabling accurate representation of dip angle and dip direction in Cartesian space. The transformed orientations are then characterised into sets using a hierarchical clustering technique, which handles varying density distributions and identifies clusters without requiring user-defined set numbers. The accuracy of the method is validated on real-world mine stope and against ground truth obtained using manually handpicked discontinuity planes identified with the Virtual Compass tool, as well as widely used automated structure mapping techniques. The proposed approach outperforms the other techniques by exhibiting the lowest mean absolute error in estimating discontinuity set orientations in real-world stope data with errors of 1.95° and 2.20° in nominal dip angle and dip direction, respectively, and dispersion errors lying below 3°.

80. 【2602.01780】DDP-WM: Disentangled Dynamics Prediction for Efficient World Models

链接https://arxiv.org/abs/2602.01780

作者:Shicheng Yin,Kaixuan Yin,Weixing Chen,Yang Liu,Guanbin Li,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:autonomous robotic planning, robotic planning, essential for autonomous, autonomous robotic, Disentangled Dynamics Prediction

备注: Codes will be available at [this https URL](https://github.com/HCPLabSYSU/DDP-WM)

点击查看摘要

Abstract:World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at this https URL.

81. 【2602.01764】GDPR-Compliant Person Recognition in Industrial Environments Using MEMS-LiDAR and Hybrid Data

链接https://arxiv.org/abs/2602.01764

作者:Dennis Basile,Dennis Sprute,Helene Dörksen,Holger Flatt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:avoid plant shutdowns, safety-critical industrial indoor, industrial indoor spaces, Data Protection Regulation, General Data Protection

备注: Accepted at 19th CIRP Conference on Intelligent Computation in Manufacturing Engineering

点击查看摘要

Abstract:The reliable detection of unauthorized individuals in safety-critical industrial indoor spaces is crucial to avoid plant shutdowns, property damage, and personal hazards. Conventional vision-based methods that use deep-learning approaches for person recognition provide image information but are sensitive to lighting and visibility conditions and often violate privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union. Typically, detection systems based on deep learning require annotated data for training. Collecting and annotating such data, however, is highly time-consuming and due to manual treatments not necessarily error free. Therefore, this paper presents a privacy-compliant approach based on Micro-Electro-Mechanical Systems LiDAR (MEMS-LiDAR), which exclusively captures anonymized 3D point clouds and avoids personal identification features. To compensate for the large amount of time required to record real LiDAR data and for post-processing and annotation, real recordings are augmented with synthetically generated scenes from the CARLA simulation framework. The results demonstrate that the hybrid data improves the average precision by 44 percentage points compared to a model trained exclusively with real data while reducing the manual annotation effort by 50 %. Thus, the proposed approach provides a scalable, cost-efficient alternative to purely real-data-based methods and systematically shows how synthetic LiDAR data can combine high performance in person detection with GDPR compliance in an industrial environment.

82. 【2602.01760】MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

链接https://arxiv.org/abs/2602.01760

作者:Hao Zhang,Yanping Zha,Zizhuo Li,Meiqi Gong,Jiayi Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly practical scenario, visible imaging sensors, practical scenario, paper focuses, highly practical

备注

点击查看摘要

Abstract:This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.

83. 【2602.01756】Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

链接https://arxiv.org/abs/2602.01756

作者:Jun He,Junyan Ye,Zilong Huang,Dongzhi Jiang,Chenjue Zhang,Leqi Zhu,Renrui Zhang,Xiang Zhang,Weijia Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved unprecedented fidelity, existing models function, models function fundamentally, unprecedented fidelity, achieved unprecedented

备注: 36 pages, 24 figures

点击查看摘要

Abstract:While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.

84. 【2602.01754】Spot-Wise Smart Parking: An Edge-Enabled Architecture with YOLOv11 and Digital Twin Integration

链接https://arxiv.org/abs/2602.01754

作者:Gustavo P. C. P. da Luz,Alvaro M. Aspilcueta Narvaez,Tiago Godoi Bannwart,Gabriel Massuyoshi Sato,Luis Fernando Gomez Gonzalez,Juliana Freitag Borin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:smart city adoption, minimize users' search, Smart parking systems, enhancing urban mobility, users' search time

备注: Submitted to Journal of Internet Services and Applications, 27 pages, 20 figures, 3 tables

点击查看摘要

Abstract:Smart parking systems help reduce congestion and minimize users' search time, thereby contributing to smart city adoption and enhancing urban mobility. In previous works, we presented a system developed on a university campus to monitor parking availability by estimating the number of free spaces from vehicle counts within a region of interest. Although this approach achieved good accuracy, it restricted the system's ability to provide spot-level insights and support more advanced applications. To overcome this limitation, we extend the system with a spot-wise monitoring strategy based on a distance-aware matching method with spatial tolerance, enhanced through an Adaptive Bounding Box Partitioning method for challenging spaces. The proposed approach achieves a balanced accuracy of 98.80% while maintaining an inference time of 8 seconds on a resource-constrained edge device, enhancing the capabilities of YOLOv11m, a model that has a size of 40.5 MB. In addition, two new components were introduced: (i) a Digital Shadow that visually represents parking lot entities as a base to evolve to a full Digital Twin, and (ii) an application support server based on a repurposed TV box. The latter not only enables scalable communication among cloud services, the parking totem, and a bot that provides detailed spot occupancy statistics, but also promotes hardware reuse as a step towards greater sustainability.

85. 【2602.01753】ObjEmbed: Towards Universal Multimodal Object Embeddings

链接https://arxiv.org/abs/2602.01753

作者:Shenghao Fu,Yukun Su,Fengyun Rao,Jing Lyu,Xiaohua Xie,Wei-Shi Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligning objects, textual descriptions, fundamental challenge, realistic requirement, requirement in vision-language

备注

点击查看摘要

Abstract:Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

86. 【2602.01741】ail-Aware Post-Training Quantization for 3D Geometry Models

链接https://arxiv.org/abs/2602.01741

作者:Sicheng Pan,Chen Tang,Shuzhao Xie,Ke Yang,Weixiang Zhang,Jiawei Li,Bin Chen,Shu-Tao Xia,Zhi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geometry models pose, models pose significant, pose significant challenges, resource-constrained platforms, pose significant

备注

点击查看摘要

Abstract:The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(\log N)$ for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.

87. 【2602.01740】MACD: Model-Aware Contrastive Decoding via Counterfactual Data

链接https://arxiv.org/abs/2602.01740

作者:Qixin Xiao,Kun Zhou

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Video language models, Video language, evidence is weak, plausible but ungrounded, ungrounded content

备注

点击查看摘要

Abstract:Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.

88. 【2602.01738】Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

链接https://arxiv.org/abs/2602.01738

作者:Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Bin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:AI-Generated Images, achieve near-perfect accuracy, dramatic performance collapse, achieve near-perfect, collapse in realistic

备注

点击查看摘要

Abstract:While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.

89. 【2602.01724】DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation

链接https://arxiv.org/abs/2602.01724

作者:Tushar Anand,Maheswar Bora,Antitza Dantcheva,Abhijit Das

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mamba block DenVisCoM, architecture specifically tailored, Mamba block, tailored for accurate, hybrid architecture specifically

备注: IEEE International Conference on Robotics and Automation 2026

点击查看摘要

Abstract:In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at this https URL.

90. 【2602.01723】FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization

链接https://arxiv.org/abs/2602.01723

作者:Yikun Ma,Yiqing Li,Jingwen Ye,Zhongkai Wu,Weidong Zhang,Lin Gao,Zhi Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, simulation remains challenging, remains challenging, Extending, Gaussian

备注

点击查看摘要

Abstract:Extending 3D Gaussian Splatting (3DGS) to 4D physical simulation remains challenging. Based on the Material Point Method (MPM), existing methods either rely on manual parameter tuning or distill dynamics from video diffusion models, limiting the generalization and optimization efficiency. Recent attempts using LLMs/VLMs suffer from a text/image-to-3D perceptual gap, yielding unstable physics behavior. In addition, they often ignore the surface structure of 3DGS, leading to implausible motion. We propose FastPhysGS, a fast and robust framework for physics-based dynamic 3DGS simulation:(1) Instance-aware Particle Filling (IPF) with Monte Carlo Importance Sampling (MCIS) to efficiently populate interior particles while preserving geometric fidelity; (2) Bidirectional Graph Decoupling Optimization (BGDO), an adaptive strategy that rapidly optimizes material parameters predicted from a VLM. Experiments show FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.

91. 【2602.01710】Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis

链接https://arxiv.org/abs/2602.01710

作者:Salma Zahran,Zhou Ao,Zhengyang Zhang,Chen Chi,Chenchen Yuan,Yanming Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)

关键词:high-throughput materials characterisation, Semantic segmentation, prohibitive cost, critical task, task for high-throughput

备注

点击查看摘要

Abstract:Semantic segmentation of microscopy images is a critical task for high-throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert-annotated data. While physics-based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour-free segmentation that successfully bridges this simulation-to-reality gap. Our pipeline leverages phase-field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically-derived ground-truth masks. We then employ a Cycle-Consistent Generative Adversarial Network (CycleGAN) for unpaired image-to-image translation, transforming the clean simulations into a large-scale dataset of high-fidelity, realistic SEM images. A U-Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1-Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t-SNE feature-space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data-scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.

92. 【2602.01696】Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection

链接https://arxiv.org/abs/2602.01696

作者:Jiaming Cui,Shuai Zhou,Wenqiang Li,Ruifeng Qin,Feng Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:automated UAV inspection, UAV inspection due, Transmission line defect, detection remains challenging, line defect detection

备注

点击查看摘要

Abstract:Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.

93. 【2602.01683】FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding

链接https://arxiv.org/abs/2602.01683

作者:Kangcong Li,Peng Ye,Lin Zhang,Chao Wang,Huafeng Qin,Tao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Transitioning Multimodal Large, Multimodal Large Language, Transitioning Multimodal, Language Models

备注

点击查看摘要

Abstract:Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain's logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical "gist"; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.

94. 【2602.01679】owards Autonomous Instrument Tray Assembly for Sterile Processing Applications

链接https://arxiv.org/abs/2602.01679

作者:Raghavasimhan Sankaranarayanan,Paul Stuart,Nicholas Ahn,Arno Sungarian,Yash Chitalia

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:assembling surgical instruments, department is responsible, responsible for cleaning, surgical instruments, Processing and Distribution

备注: 7 pages, 9 figures, 2026 International Symposium on Medical Robotics

点击查看摘要

Abstract:The Sterile Processing and Distribution (SPD) department is responsible for cleaning, disinfecting, inspecting, and assembling surgical instruments between surgeries. Manual inspection and preparation of instrument trays is a time-consuming, error-prone task, often prone to contamination and instrument breakage. In this work, we present a fully automated robotic system that sorts and structurally packs surgical instruments into sterile trays, focusing on automation of the SPD assembly stage. A custom dataset comprising 31 surgical instruments and 6,975 annotated images was collected to train a hybrid perception pipeline using YOLO12 for detection and a cascaded ResNet-based model for fine-grained classification. The system integrates a calibrated vision module, a 6-DOF Staubli TX2-60L robotic arm with a custom dual electromagnetic gripper, and a rule-based packing algorithm that reduces instrument collisions during transport. The packing framework uses 3D printed dividers and holders to physically isolate instruments, reducing collision and friction during transport. Experimental evaluations show high perception accuracy and statistically significant reduction in tool-to-tool collisions compared to human-assembled trays. This work serves as the scalable first step toward automating SPD workflows, improving safety, and consistency of surgical preparation while reducing SPD processing times.

95. 【2602.01677】SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking

链接https://arxiv.org/abs/2602.01677

作者:Yinchao Ma,Dengqing Yang,Zhangyu He,Wenfei Yang,Tianzhu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video sequence, dynamic scenarios, Visual tracking aims, aims to automatically, automatically estimate

备注: This paper is accepted by IEEE TIP

点击查看摘要

Abstract:Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.

96. 【2602.01674】VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR

链接https://arxiv.org/abs/2602.01674

作者:Hail Song,Boram Yoon,Seokhwan Yang,Seoyoung Kang,Hyunjeong Kim,Henning Metzmacher,Woontack Woo

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Gaussian Splatting, enables real-time full-body, tracking signals, enables real-time, virtual reality

备注: Accepted as an IEEE TVCG paper at IEEE VR 2026 (journal track)

点击查看摘要

Abstract:We present VRGaussianAvatar, an integrated system that enables real-time full-body 3D Gaussian Splatting (3DGS) avatars in virtual reality using only head-mounted display (HMD) tracking signals. The system adopts a parallel pipeline with a VR Frontend and a GA Backend. The VR Frontend uses inverse kinematics to estimate full-body pose and streams the resulting pose along with stereo camera parameters to the backend. The GA Backend stereoscopically renders a 3DGS avatar reconstructed from a single image. To improve stereo rendering efficiency, we introduce Binocular Batching, which jointly processes left and right eye views in a single batched pass to reduce redundant computation and support high-resolution VR displays. We evaluate VRGaussianAvatar with quantitative performance tests and a within-subject user study against image- and video-based mesh avatar baselines. Results show that VRGaussianAvatar sustains interactive VR performance and yields higher perceived appearance similarity, embodiment, and plausibility. Project page and source code are available at this https URL.

97. 【2602.01673】Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss

链接https://arxiv.org/abs/2602.01673

作者:Enguang Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:Loop closure detection, correct accumulated drift, enables pose-graph constraints, Loop closure, identifies revisited places

备注

点击查看摘要

Abstract:Loop closure detection (LCD) is a core component of simultaneous localization and mapping (SLAM): it identifies revisited places and enables pose-graph constraints that correct accumulated drift. Classic bag-of-words approaches such as DBoW are efficient but often degrade under appearance change and perceptual aliasing. In parallel, deep learning-based visual place recognition (VPR) descriptors (e.g., NetVLAD and Transformer-based models) offer stronger robustness, but their computational cost is often viewed as a barrier to real-time SLAM. In this paper, we empirically evaluate NetVLAD as an LCD module and compare it against DBoW on the KITTI dataset. We introduce a Fine-Grained Top-K precision-recall curve that better reflects LCD settings where a query may have zero or multiple valid matches. With Faiss-accelerated nearestneighbor search, NetVLAD achieves real-time query speed while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM.

98. 【2602.01666】Moonworks Lunara Aesthetic II: An Image Variation Dataset

链接https://arxiv.org/abs/2602.01666

作者:Yan Wang,Partho Hassan,Samiha Sadeka,Nada Soliman,M M Sayeef Abdullah,Sabit Hassan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:support controlled evaluation, ethically sourced image, introduce Lunara Aesthetic, Lunara Aesthetic, ethically sourced

备注

点击查看摘要

Abstract:We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: this https URL.

99. 【2602.01661】From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

链接https://arxiv.org/abs/2602.01661

作者:Xingyu Miao,Junting Dong,Qin Zhao,Yuhang Yang,Junhao Chen,Yang Long

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:temporally consistent human-centric, consistent human-centric dense, human-centric dense prediction, challenge of temporally, temporally consistent

备注

点击查看摘要

Abstract:In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.

100. 【2602.01649】Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

链接https://arxiv.org/abs/2602.01649

作者:Yinchao Ma,Qiang Zhou,Zhibin Wang,Xianing Chen,Hanqing Yang,Jun Song,Bo Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large language models, demonstrated remarkable capabilities, Video large language, large language, language models

备注: This paper is accepted by AAAI2026

点击查看摘要

Abstract:Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.

101. 【2602.01644】From Perception to Action: Spatial AI Agents and World Models

链接https://arxiv.org/abs/2602.01644

作者:Gloria Felicia,Nolan Bryant,Handi Putra,Ayaan Gazali,Eliel Lobo,Esteban Rojas

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)

关键词:large language models, large language, readily translate, Spatial, prevailing approach

备注: 61 pages, 742 citations, 1 figure, 3 tables. Survey paper on spatial AI agents, embodied AI, graph neural networks, and world models

点击查看摘要

Abstract:While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top-tier venues, we introduce a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long-horizon spatial tasks. (2) GNN-LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro-to-macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross-domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.

102. 【2602.01639】ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

链接https://arxiv.org/abs/2602.01639

作者:Tianyu Yang,ChenWei He,Xiangzhao Hao,Tianyue Wang,Jiarui Guo,Haiyun Guo,Leigang Qu,Jinqiao Wang,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Image Retrieval, retrieve target images, target images based, hybrid query comprising, Composed Image

备注

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.

103. 【2602.01633】Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification

链接https://arxiv.org/abs/2602.01633

作者:Xinyuan Zhao,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved significant advances, typically require large, Vision Transformer, require large datasets, significant advances

备注: Accepted in Knowledge-Based Systems

点击查看摘要

Abstract:While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client's sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98\% to 41.69\%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: this https URL.

104. 【2602.01630】Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

链接https://arxiv.org/abs/2602.01630

作者:Bohan Zeng,Kaixin Zhu,Daili Hua,Bozhou Li,Chengzhuo Tong,Yuran Wang,Xinyi Huang,Yifan Dai,Zixiang Zhang,Yifan Yang,Zhou Liu,Hao Liang,Xiaochen Ma,Ruichuan An,Tianyi Bai,Hongcheng Gao,Junbo Niu,Yang Shi,Xinlong Chen,Yue Ding,Minglei Shi,Kai Zeng,Yiwen Tang,Yuanxing Zhang,Pengfei Wan,Xintao Wang,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhance large models, aiming to enhance, critical frontier, enhance large, physical dynamics

备注: 13 pages, 4 figures

点击查看摘要

Abstract:World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task-specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.

105. 【2602.01624】PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

链接https://arxiv.org/abs/2602.01624

作者:Minh-Quan Le,Gaurav Mittal,Cheng Zhao,David Gu,Dimitris Samaras,Mei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high visual quality, PISCES, aims to synthesize, Dual Optimal Transport, quality

备注

点击查看摘要

Abstract:Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

106. 【2602.01623】Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

链接https://arxiv.org/abs/2602.01623

作者:Susan Liang,Chao Huang,Filippos Bellos,Yolo Yunlong Tang,Qianxiang Shen,Jing Bi,Luchuan Song,Zeliang Zhang,Jason Corso,Chenliang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synchronized audio directly, produce high-fidelity videos, produce high-fidelity, textual prompt, Sora

备注

点击查看摘要

Abstract:State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

107. 【2602.01609】oken Pruning for In-Context Generation in Diffusion Transformers

链接https://arxiv.org/abs/2602.01609

作者:Junqing Lin,Xingyu Zheng,Pei Cheng,Bin Fu,Jingwei Sun,Guangzhong Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhances Diffusion Transformers, significantly enhances Diffusion, generation significantly enhances, Diffusion Transformers, enabling controllable

备注: 20 pages

点击查看摘要

Abstract:In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

108. 【2602.01594】UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception

链接https://arxiv.org/abs/2602.01594

作者:Wenzhuo Liu,Qiannan Guo,Zhen Wang,Wenshuo Wang,Lei Yang,Yicheng Qiao,Lening Wang,Zhiwei Li,Chen Lv,Shanghang Zhang,Junqiang Xi,Huaping Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Advanced Driver Assistance, Driver Assistance Systems, Assistance Systems, understand human driver, impair system performance

备注

点击查看摘要

Abstract:Advanced Driver Assistance Systems (ADAS) need to understand human driver behavior while perceiving their navigation context, but jointly learning these heterogeneous tasks would cause inter-task negative transfer and impair system performance. Here, we propose a Unified and Versatile Multimodal Multi-Task Learning (UV-M3TL) framework to simultaneously recognize driver behavior, driver emotion, vehicle behavior, and traffic context, while mitigating inter-task negative transfer. Our framework incorporates two core components: dual-branch spatial channel multimodal embedding (DB-SCME) and adaptive feature-decoupled multi-task loss (AFD-Loss). DB-SCME enhances cross-task knowledge transfer while mitigating task conflicts by employing a dual-branch structure to explicitly model salient task-shared and task-specific features. AFD-Loss improves the stability of joint optimization while guiding the model to learn diverse multi-task representations by introducing an adaptive weighting mechanism based on learning dynamics and feature decoupling constraints. We evaluate our method on the AIDE dataset, and the experimental results demonstrate that UV-M3TL achieves state-of-the-art performance across all four tasks. To further prove the versatility, we evaluate UV-M3TL on additional public multi-task perception benchmarks (BDD100K, CityScapes, NYUD-v2, and PASCAL-Context), where it consistently delivers strong performance across diverse task combinations, attaining state-of-the-art results on most tasks.

109. 【2602.01593】Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework

链接https://arxiv.org/abs/2602.01593

作者:Wenzhuo Zhao,Keren Fu,Jiahao He,Xiaohong Liu,Qijun Zhao,Guangtao Zhai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:convolutional neural networks, complexity of Transformers, limited receptive fields, quadratic computational complexity, salient object detection

备注

点击查看摘要

Abstract:Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the "task-specific" problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.

110. 【2602.01591】Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

链接https://arxiv.org/abs/2602.01591

作者:Zhixiong Yue,Zixuan Ni,Feiyang Ye,Jinshan Zhang,Sheng Shen,Zhenpeng Mi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, flow matching models, flow matching, flow matching text, enhanced human preference

备注

点击查看摘要

Abstract:Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model's sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.

111. 【2602.01589】Genus-0 Surface Parameterization using Spherical Beltrami Differentials

链接https://arxiv.org/abs/2602.01589

作者:Zhehao Xu,Lok Ming Lui

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Geometry (math.AG)

关键词:imaging science, Spherical surface parameterization, fundamental tool, tool in geometry, geometry processing

备注

点击查看摘要

Abstract:Spherical surface parameterization is a fundamental tool in geometry processing and imaging science. For a genus-0 closed surface, many efficient algorithms can map the surface to the sphere; consequently, a broad class of task-driven genus-0 mapping problems can be reduced to constructing a high-quality spherical self-map. However, existing approaches often face a trade-off between satisfying task objectives (e.g., landmark or feature alignment), maintaining bijectivity, and controlling geometric distortion. We introduce the Spherical Beltrami Differential (SBD), a two-chart representation of quasiconformal self-maps of the sphere, and establish its correspondence with spherical homeomorphisms up to conformal automorphisms. Building on the Spectral Beltrami Network (SBN), we propose a neural optimization framework BOOST that optimizes two Beltrami fields on hemispherical stereographic charts and enforces global consistency through explicit seam-aware constraints. Experiments on large-deformation landmark matching and intensity-based spherical registration demonstrate the effectiveness of our proposed framework. We further apply the method to brain cortical surface registration, aligning sulcal landmarks and jointly matching cortical sulci depth maps, showing improved task fidelity with controlled distortion and robust bijective behavior.

112. 【2602.01586】HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation

链接https://arxiv.org/abs/2602.01586

作者:Wencan Cheng,Gim Hee Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:augmented reality, locations is crucial, human-computer interaction applications, hand pose estimation, human hand keypoint

备注: AAAI accepted

点击查看摘要

Abstract:3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.

113. 【2602.01576】Generative Visual Code Mobile World Models

链接https://arxiv.org/abs/2602.01576

作者:Woosung Koh,Sungjun Han,Segyu Lee,Se-Young Yun,Jamin Shin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, Mobile Graphical User, User Interface, Graphical User, Mobile Graphical

备注: Pre-print (technical report)

点击查看摘要

Abstract:Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

114. 【2602.01574】SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models

链接https://arxiv.org/abs/2602.01574

作者:Haobo Wang,Weiqi Luo,Xiaojun Jia,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, transfer-based adversarial perturbations, Large vision-language, black-box VLM outputs, manipulate black-box VLM

备注

点击查看摘要

Abstract:Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations, enabling attackers to optimize on surrogate models and manipulate black-box VLM outputs. Prior targeted transfer attacks often overfit surrogate-specific embedding space by relying on a single reference and emphasizing final-layer alignment, which underutilizes intermediate semantics and degrades transfer across heterogeneous VLMs. To address this, we propose SGHA-Attack, a Semantic-Guided Hierarchical Alignment framework that adopts multiple target references and enforces intermediate-layer consistency. Concretely, we generate a visually grounded reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate to form a weighted mixture for stable optimization guidance. Building on these anchors, SGHA-Attack injects target semantics throughout the feature hierarchy by aligning intermediate visual representations at both global and spatial granularities across multiple depths, and by synchronizing intermediate visual and textual features in a shared latent subspace to provide early cross-modal supervision before the final projection. Extensive experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods and remains robust under preprocessing and purification defenses.

115. 【2602.01570】One-Step Diffusion for Perceptual Image Compression

链接https://arxiv.org/abs/2602.01570

作者:Yiwen Jia,Hao Wei,Yanhui Zhou,Chenyang Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved notable progress, delivering high perceptual, notable progress, delivering high, low bitrates

备注

点击查看摘要

Abstract:Diffusion-based image compression methods have achieved notable progress, delivering high perceptual quality at low bitrates. However, their practical deployment is hindered by significant inference latency and heavy computational overhead, primarily due to the large number of denoising steps required during decoding. To address this problem, we propose a diffusion-based image compression method that requires only a single-step diffusion process, significantly improving inference speed. To enhance the perceptual quality of reconstructed images, we introduce a discriminator that operates on compact feature representations instead of raw pixels, leveraging the fact that features better capture high-level texture and structural details. Experimental results show that our method delivers comparable compression performance while offering a 46$\times$ faster inference speed compared to recent diffusion-based approaches. The source code and models are available at this https URL.

116. 【2602.01561】Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd

链接https://arxiv.org/abs/2602.01561

作者:Yejin Son,Saejin Kim,Dongjun Min,Younjae Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal contexts remains, artificial intelligence, Commonsense reasoning, contexts remains, remains a foundational

备注: 24 pages

点击查看摘要

Abstract:Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models' ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models' robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.

117. 【2602.01559】Combined Flicker-banding and Moire Removal for Screen-Captured Images

链接https://arxiv.org/abs/2602.01559

作者:Libo Zhu,Zihan Zhou,Zhiyi Zhou,Yiyang Qu,Weihang Zhang,Keyu Shi,Yifan Fu,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Capturing display screens, significant visual quality, Capturing display, severe degradations caused, visual quality degradation

备注

点击查看摘要

Abstract:Capturing display screens with mobile devices has become increasingly common, yet the resulting images often suffer from severe degradations caused by the coexistence of moiré patterns and flicker-banding, leading to significant visual quality degradation. Due to the strong coupling of these two artifacts in real imaging processes, existing methods designed for single degradations fail to generalize to such compound scenarios. In this paper, we present the first systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, and propose a unified restoration framework, named CLEAR. To support this task, we construct a large-scale dataset containing both moiré patterns and flicker-banding, and introduce an ISP-based flicker simulation pipeline to stabilize model training and expand the degradation distribution. Furthermore, we design a frequency-domain decomposition and re-composition module together with a trajectory alignment loss to enhance the modeling of compound artifacts. Extensive experiments demonstrate that the proposed method consistently. outperforms existing image restoration approaches across multiple evaluation metrics, validating its effectiveness in complex real-world scenarios.

118. 【2602.01554】InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

链接https://arxiv.org/abs/2602.01554

作者:Lv Tang,Tianyi Zheng,Bo Li,Xingyu Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, maps visual inputs, multimodal large language, understanding and generation, language models

备注

点击查看摘要

Abstract:Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.

119. 【2602.01541】oward Cognitive Supersensing in Multimodal Large Language Model

链接https://arxiv.org/abs/2602.01541

作者:Boyi Li,Yifan Shen,Yuanzhe Liu,Yifan Xu,Jiateng Liu,Xinzhuo Li,Zhengyuan Li,Jingyuan Zhu,Yunhan Zhong,Fangzhou Lan,Jianguo Cao,James M. Rehg,Heng Ji,Ismini Lourentzou,Xu Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, problems remains limited, achieved remarkable success

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.

120. 【2602.01540】FSCA-Net: Feature-Separated Cross-Attention Network for Robust Multi-Dataset Training

链接https://arxiv.org/abs/2602.01540

作者:Yuehai Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:smart city management, traffic regulation, public safety, city management, plays a vital

备注

点击查看摘要

Abstract:Crowd counting plays a vital role in public safety, traffic regulation, and smart city management. However, despite the impressive progress achieved by CNN- and Transformer-based models, their performance often deteriorates when applied across diverse environments due to severe domain discrepancies. Direct joint training on multiple datasets, which intuitively should enhance generalization, instead results in negative transfer, as shared and domain-specific representations become entangled. To address this challenge, we propose the Feature Separation and Cross-Attention Network FSCA-Net, a unified framework that explicitly disentangles feature representations into domain-invariant and domain-specific components. A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer while preserving dataset-specific discriminability. Furthermore, a mutual information optimization objective is introduced to maximize consistency among domain-invariant features and minimize redundancy among domain-specific ones, promoting complementary shared-private representations. Extensive experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization, providing a robust and scalable solution for real-world crowd analysis.

121. 【2602.01538】Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

链接https://arxiv.org/abs/2602.01538

作者:Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:talking avatars, grounded human-object, Aware Generation Module, perception, grounded human-object interaction

备注

点击查看摘要

Abstract:Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL

122. 【2602.01536】UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning

链接https://arxiv.org/abs/2602.01536

作者:Shuai Liu,Siheng Ren,Xiaoyao Zhu,Quanmin Liang,Zefeng Li,Qiang Li,Xin Hu,Kai Huang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Achieving reliable, complex driving environments, driving environments requires, reliable and efficient, environments requires

备注

点击查看摘要

Abstract:Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene's geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene's structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at this https URL.

123. 【2602.01533】Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units

链接https://arxiv.org/abs/2602.01533

作者:Zhe Ling,Sicheng Yu,Danyu Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Online handwritten character, generally provide higher, provide higher accuracy, leverages stroke order, Online handwritten

备注

点击查看摘要

Abstract:Online handwritten character recognition leverages stroke order and dynamic features, which generally provide higher accuracy and robustness compared with offline recognition. However, in practical applications, rotational deformations can disrupt the spatial layout of strokes, substantially reducing recognition accuracy. Extracting rotation-invariant features therefore remains a challenging open problem. In this work, we employ the Sliding Window Path Signature (SW-PS) to capture local structural features of characters, and introduce the lightweight Linear Recurrent Units (LRU) as the classifier. The LRU combine the fast incremental processing capability of recurrent neural networks (RNN) with the efficient parallel training of state space models (SSM), while reliably modelling dynamic stroke characteristics. We conducted recognition experiments with random rotation angle up to $\pm 180^{\circ}$ on three subsets of the CASIA-OLHWDB1.1 dataset: digits, English upper letters, and Chinese radicals. The accuracies achieved after ensemble learning were $99.62\%$, $96.67\%$, and $94.33\%$, respectively. Experimental results demonstrate that the proposed SW-PS+LRU framework consistently surpasses competing models in both convergence speed and test accuracy.

124. 【2602.01530】Preserving Localized Patch Semantics in VLMs

链接https://arxiv.org/abs/2602.01530

作者:Parsa Esmaeilkhani,Longin Jan Latecki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Logit Lens, Logit Lens Loss, LLM answers, proposed Logit Lens, renders Logit Lens

备注

点击查看摘要

Abstract:Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word "cat"), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.

125. 【2602.01527】oward a Machine Bertin: Why Visualization Needs Design Principles for Machine Cognition

链接https://arxiv.org/abs/2602.01527

作者:Brian Keith-Norambuena

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:preattentive processing rules, design knowledge-effectiveness rankings, knowledge-effectiveness rankings, preattentive processing, processing rules

备注: Preprint submitted to IEEE TVCG on February 2026

点击查看摘要

Abstract:Visualization's design knowledge-effectiveness rankings, encoding guidelines, color models, preattentive processing rules -- derives from six decades of psychophysical studies of human vision. Yet vision-language models (VLMs) increasingly consume chart images in automated analysis pipelines, and a growing body of benchmark evidence indicates that this human-centered knowledge base does not straightforwardly transfer to machine audiences. Machines exhibit different encoding performance patterns, process images through patch-based tokenization rather than holistic perception, and fail on design patterns that pose no difficulty for humans-while occasionally succeeding where humans struggle. Current approaches address this gap primarily by bypassing vision entirely, converting charts to data tables or structured text. We argue that this response forecloses a more fundamental question: what visual representations would actually serve machine cognition well? This paper makes the case that the visualization field needs to investigate machine-oriented visual design as a distinct research problem. We synthesize evidence from VLM benchmarks, visual reasoning research, and visualization literacy studies to show that the human-machine perceptual divergence is qualitative, not merely quantitative, and critically examine the prevailing bypassing approach. We propose a conceptual distinction between human-oriented and machine-oriented visualization-not as an engineering architecture but as a recognition that different audiences may require fundamentally different design foundations-and outline a research agenda for developing the empirical foundations the field currently lacks: the beginnings of a "machine Bertin" to complement the human-centered knowledge the field already possesses.

126. 【2602.01522】When Is Rank-1 Enough? Geometry-Guided Initialization for Parameter-Efficient Fine-Tuning

链接https://arxiv.org/abs/2602.01522

作者:Haoran Zhao,Soyeon Caren Han,Eduard Hovy

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, adapt multimodal large, multimodal large language, Parameter-efficient fine-tuning, extremely low-rank settings

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings -- especially rank-1 LoRA -- are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant "gap" direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.

127. 【2602.01501】reeLoc: 6-DoF LiDAR Global Localization in Forests via Inter-Tree Geometric Matching

链接https://arxiv.org/abs/2602.01501

作者:Minwoo Jung,Nived Chebrolu,Lucas Carvalho de Lima,Haedam Oh,Maurice Fallon,Ayoung Kim

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable localization, measurements are repetitive, structurally complex, crucial for navigation, degraded and LiDAR

备注: An 8-page paper with 7 tables and 8 figures, accepted to ICRA 2026

点击查看摘要

Abstract:Reliable localization is crucial for navigation in forests, where GPS is often degraded and LiDAR measurements are repetitive, occluded, and structurally complex. These conditions weaken the assumptions of traditional urban-centric localization methods, which assume that consistent features arise from unique structural patterns, necessitating forest-centric solutions to achieve robustness in these environments. To address these challenges, we propose TreeLoc, a LiDAR-based global localization framework for forests that handles place recognition and 6-DoF pose estimation. We represent scenes using tree stems and their Diameter at Breast Height (DBH), which are aligned to a common reference frame via their axes and summarized using the tree distribution histogram (TDH) for coarse matching, followed by fine matching with a 2D triangle descriptor. Finally, pose estimation is achieved through a two-step geometric verification. On diverse forest benchmarks, TreeLoc outperforms baselines, achieving precise localization. Ablation studies validate the contribution of each component. We also propose applications for long-term forest management using descriptors from a compact global tree database. TreeLoc is open-sourced for the robotics community at this https URL.

128. 【2602.01459】Understanding vision transformer robustness through the lens of out-of-distribution detection

链接https://arxiv.org/abs/2602.01459

作者:Joey Kuang,Alexander Wong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:shown remarkable performance, shown remarkable, accessible and real-time, remarkable performance, Vision transformers

备注: Accepted to JCVIS 2025

点击查看摘要

Abstract:Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.

129. 【2602.01456】Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

链接https://arxiv.org/abs/2602.01456

作者:Yilun Kuang,Yash Dagade,Tim G. J. Rudner,Randall Balestriero,Yann LeCun

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Joint-Embedding Predictive Architectures, Predictive Architectures, Joint-Embedding Predictive, admit projection-based distribution, projection-based distribution matching

备注

点击查看摘要

Abstract:Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while preserving maximum-entropy up to rescaling under expected $\ell_p$ norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.

130. 【2602.01452】Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles

链接https://arxiv.org/abs/2602.01452

作者:Penghao Deng,Jidong J. Yang,Jiachen Bian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:developing next-generation advanced, next-generation advanced driver-assistance, improving road safety, advanced driver-assistance systems, attention during driving

备注: 21 pages, 15 figures, 3 tables

点击查看摘要

Abstract:Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.

131. 【2602.01435】BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images

链接https://arxiv.org/abs/2602.01435

作者:Soumyaroop Nandi,Prem Natarajan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:State Space Model, State Space, inspired by State, Space Model, SSM

备注

点击查看摘要

Abstract:We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. Code - this https URL

132. 【2602.01418】Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

链接https://arxiv.org/abs/2602.01418

作者:Christoffer Koo Øhrstrøm,Rafael I. Cabral Muchacho,Yifei Dong,Filippos Moumtzidellis,Ronja Güldenring,Florian T. Pokorny,Lazaros Nalpantidis

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:propose Parabolic Position, Parabolic Position Encoding, propose Parabolic, Parabolic Position, attention-based architectures

备注

点击查看摘要

Abstract:We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at this https URL.

133. 【2602.01391】Stronger Semantic Encoders Can Harm Relighting Performance: Probing Visual Priors via Augmented Latent Intrinsics

链接https://arxiv.org/abs/2602.01391

作者:Xiaoyan Xing,Xiao Zhang,Sezer Karaoglu,Theo Gevers,Anand Bhattad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:disentangle scene properties, properties from illumination, disentangle scene, scene properties, relighting requires representations

备注: Project page: https:\\ [this http URL](http://augmented-latent-intrinsics.github.io)

点击查看摘要

Abstract:Image-to-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong improvements in relighting, with the largest gains on complex, specular materials. Project page: https:\\this http URL

134. 【2602.01382】PromptRL: Prompt Matters in RL for Flow-Based Image Generation

链接https://arxiv.org/abs/2602.01382

作者:Fu-Yun Wang,Han Zhang,Michael Gharbi,Hongsheng Li,Taesung Park

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Flow matching models, critical post-training strategy, Flow matching, reward objectives, critical post-training

备注

点击查看摘要

Abstract:Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.01382 [cs.CV]

(or
arXiv:2602.01382v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.01382

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
135. 【2602.01370】PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

链接https://arxiv.org/abs/2602.01370

作者:Leonardo Brusini,Cristian Sbrolli,Eugenio Lomurno,Toshihiko Yamasaki,Matteo Matteucci

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:single generative backbone, methods typically rely, generator-specific spectral biases, Synthetic data offers, introduces generator-specific spectral

备注

点击查看摘要

Abstract:Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.

136. 【2602.01369】Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts

链接https://arxiv.org/abs/2602.01369

作者:Songping Wang,Qinglong Liu,Yueming Lyu,Ning Li,Ziwen He,Caifeng Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong performance, video understanding tasks, robustness remains underexplored, Temporal Lipschitz-Guided Attacks, understanding tasks

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles' Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.

137. 【2602.01352】2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

链接https://arxiv.org/abs/2602.01352

作者:Xingzu Zhan,Chen Xie,Honghang Chen,Yixun Lin,Xiaochun Mai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:humanoid robotic interaction, attracted increasing attention, human motion sequences, converts motion language, motion language descriptions

备注: 8 pages,5 figures

点击查看摘要

Abstract:Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

138. 【2602.01345】Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis

链接https://arxiv.org/abs/2602.01345

作者:Yu Zhang,Jingyi Liu,Feng Liu,Duoqian Miao,Qi Zhang,Kexue Fu,Changwei Wang,Longbing Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual AutoRegressive modeling, substantial computational cost, computational cost due, Visual AutoRegressive, token count involved

备注: 11 pages, 8 figures

点击查看摘要

Abstract:Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.

139. 【2602.01340】MTC-VAE: Multi-Level Temporal Compression with Content Awareness

链接https://arxiv.org/abs/2602.01340

作者:Yubo Dong,Linchao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Latent Video Diffusion, compact latent representations, Variational Autoencoders, continuous Variational Autoencoders, Video Diffusion Models

备注

点击查看摘要

Abstract:Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression this http URL, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.

140. 【2602.01335】Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

链接https://arxiv.org/abs/2602.01335

作者:Yu Xu,Yuxin Zhang,Juan Cao,Lin Gao,Chunyu Wang,Oliver Deussen,Tong-Yee Lee,Fan Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:transform abstract concepts, impactful visual rhetoric, employing cross-domain semantic, cross-domain semantic fusion, visual metaphor constitutes

备注: 11 pages, 10 figures

点击查看摘要

Abstract:A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.

141. 【2602.01334】What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

链接https://arxiv.org/abs/2602.01334

作者:Yan Ma,Weiyu Zhang,Tianle Li,Linge Du,Xuyang Shen,Pengfei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:URL introduce MED, http URL introduce, achieves strong performance, strong performance gains, equip vision-language models

备注: code: [this https URL](https://github.com/GAIR-NLP/Med)

点击查看摘要

Abstract:Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic this http URL introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.

142. 【2602.01329】FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching

链接https://arxiv.org/abs/2602.01329

作者:Divya Jyoti Bajpai,Shubham Agarwal,Apoorv Saxena,Kuldeep Kulkarni,Subrata Mitra,Manjesh Kumar Hanawal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Flow Matching, high-quality visual generation, recently emerged, powerful approach, approach for high-quality

备注: Accepted at International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.

143. 【2602.01306】DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling

链接https://arxiv.org/abs/2602.01306

作者:Ayushman Sarkar,Zhenyu Yu,Mohd Yamani Idna Idris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Maintaining visual, key challenge, Maintaining, Abstract, storytelling

备注

点击查看摘要

Abstract:Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: this https URL

144. 【2602.01305】StoryState: Agent-Based State Control for Consistent and Editable Storybooks

链接https://arxiv.org/abs/2602.01305

作者:Ayushman Sarkar,Zhenyu Yu,Wei Tang,Chu Chen,Kangning Cui,Mohd Yamani Idna Idris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large multimodal models, enabled one-click storybook, Large multimodal, one-click storybook generation, multi-page illustrated story

备注

点击查看摘要

Abstract:Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at this https URL

145. 【2602.01303】ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

链接https://arxiv.org/abs/2602.01303

作者:Ayushman Sarkar,Zhenyu Yu,Chu Chen,Wei Tang,Kangning Cui,Mohd Yamani Idna Idris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating coherent visual, Generating coherent, coherent visual stories, visual stories requires, requires maintaining subject

备注

点击查看摘要

Abstract:Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: this https URL

146. 【2602.01298】Interaction-Consistent Object Removal via MLLM-Based Reasoning

链接https://arxiv.org/abs/2602.01298

作者:Ching-Kai Huang,Wen-Chieh Lin,Yan-Cen Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image-based object removal, Image-based object, object removal, result semantically inconsistent, semantically inconsistent

备注

点击查看摘要

Abstract:Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support evaluation, we introduce ICOREval, a benchmark consisting of instruction-driven removals with rich interaction dependencies. On ICOREval, REORM outperforms state-of-the-art image editing systems, demonstrating its effectiveness in producing interactionconsistent results.

147. 【2602.01296】Interacted Planes Reveal 3D Line Mapping

链接https://arxiv.org/abs/2602.01296

作者:Zeran Ke,Bin Tan,Gui-Song Xia,Yujun Shen,Nan Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-view RGB images, multi-view RGB, RGB images, line mapping, line

备注: submitted to TPAMI

点击查看摘要

Abstract:3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimization framework that explicitly models learnable line and planar primitives. This coupling enables accurate and detailed 3D line mapping while maintaining strong efficiency (typically completing a reconstruction in 3 to 5 minutes per scene). LiP-Map pioneers the integration of planar topology into 3D line mapping, not by imposing pairwise coplanarity constraints but by explicitly constructing interactions between plane and line primitives, thus offering a principled route toward structured reconstruction in man-made environments. On more than 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks\Temple, LiP-Map improves both accuracy and completeness over state-of-the-art methods. Beyond line mapping quality, LiP-Map significantly advances line-assisted visual localization, establishing strong performance on 7Scenes. Our code is released at this https URL for reproducible research.

148. 【2602.01289】Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

链接https://arxiv.org/abs/2602.01289

作者:Dung Anh Hoang,Cuong Pham anh Trung Le,Jianfei Cai,Toan Do

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:shown remarkable performance, Diffusion models, image synthesis, real image, Gaussian distribution

备注

点击查看摘要

Abstract:Diffusion models have shown remarkable performance in image synthesis by progressively estimating a smooth transition from a Gaussian distribution of noise to a real image. Unfortunately, their practical deployment is limited by slow inference speed, high memory usage, and the computational demands of the noise estimation process. Post-training quantization (PTQ) emerges as a promising solution to accelerate sampling and reduce memory overhead for diffusion models. Existing PTQ methods for diffusion models typically apply uniform weights to calibration samples across timesteps, which is sub-optimal since data at different timesteps may contribute differently to the diffusion process. Additionally, due to varying activation distributions and gradients across timesteps, a uniform quantization approach is sub-optimal. Each timestep requires a different gradient direction for optimal quantization, and treating them equally can lead to conflicting gradients that degrade performance. In this paper, we propose a novel PTQ method that addresses these challenges by assigning appropriate weights to calibration samples. Specifically, our approach learns to assign optimal weights to calibration samples to align the quantized model's gradients across timesteps, facilitating the quantization process. Extensive experiments on CIFAR-10, LSUN-Bedrooms, and ImageNet demonstrate the superiority of our method compared to other PTQ methods for diffusion models.

149. 【2602.01284】Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection

链接https://arxiv.org/abs/2602.01284

作者:Chen Chen,Dion Hoe-Lian Goh

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:increasingly difficult, media literacy interventions, deepfake videos, literacy interventions, understanding the strategies

备注

点击查看摘要

Abstract:As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.

150. 【2602.01283】Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

链接https://arxiv.org/abs/2602.01283

作者:Xianhui Zhang,Chengyu Xie,Linxia Zhu,Yonghui Yang,Weixiang Zhao,Zifeng Cheng,Cong Wang,Fei Shen,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multilingual safety remains, Toggle, Multilingual safety, safety, shared safety neurons

备注

点击查看摘要

Abstract:Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.01283 [cs.CV]

(or
arXiv:2602.01283v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.01283

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Fei Shen [view email] [v1]
Sun, 1 Feb 2026 15:28:02 UTC (767 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons, by Xianhui Zhang and 8 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

151. 【2602.01278】DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction

链接https://arxiv.org/abs/2602.01278

作者:Zhengbo Zhang,Yihe Tian,Wanke Xia,Lin Chen,Yue Sun,Kun Ding,Ying Wang,Bing Xu,Shiming Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution remote sensing, remote sensing imagery, sustainable development, high-resolution remote, remote sensing

备注

点击查看摘要

Abstract:Accurate extraction of rural roads from high-resolution remote sensing imagery is essential for infrastructure planning and sustainable development. However, this task presents unique challenges in rural settings due to several factors. These include high intra-class variability and low inter-class separability from diverse surface materials, frequent vegetation occlusions that disrupt spatial continuity, and narrow road widths that exacerbate detection difficulties. Existing methods, primarily optimized for structured urban environments, often underperform in these scenarios as they overlook such distinctive characteristics. To address these challenges, we propose DSFC-Net, a dual-encoder framework that synergistically fuses spatial and frequency-domain information. Specifically, a CNN branch is employed to capture fine-grained local road boundaries and short-range continuity, while a novel Spatial-Frequency Hybrid Transformer (SFT) is introduced to robustly model global topological dependencies against vegetation occlusions. Distinct from standard attention mechanisms that suffer from frequency bias, the SFT incorporates a Cross-Frequency Interaction Attention (CFIA) module that explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy. This design enables the dynamic interaction between spatial details and frequency-aware global contexts, effectively preserving the connectivity of narrow roads. Furthermore, a Channel Feature Fusion Module (CFFM) is proposed to bridge the two branches by adaptively recalibrating channel-wise feature responses, seamlessly integrating local textures with global semantics for accurate segmentation. Comprehensive experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.

152. 【2602.01277】F-Lane: Traffic Flow Module for Robust Lane Perception

链接https://arxiv.org/abs/2602.01277

作者:Yihan Xie,Han Xia,Zhen Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:provide insufficient cues, systems require robust, vision-based detection methods, detection methods suffer, methods suffer significant

备注: 9 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Autonomous driving systems require robust lane perception capabilities, yet existing vision-based detection methods suffer significant performance degradation when visual sensors provide insufficient cues, such as in occluded or lane-missing scenarios. While some approaches incorporate high-definition maps as supplementary information, these solutions face challenges of high subscription costs and limited real-time performance. To address these limitations, we explore an innovative information source: traffic flow, which offers real-time capabilities without additional costs. This paper proposes a TrafficFlow-aware Lane perception Module (TFM) that effectively extracts real-time traffic flow features and seamlessly integrates them with existing lane perception algorithms. This solution originated from real-world autonomous driving conditions and was subsequently validated on open-source algorithms and datasets. Extensive experiments on four mainstream models and two public datasets (Nuscenes and OpenLaneV2) using standard evaluation metrics show that TFM consistently improves performance, achieving up to +4.1% mAP gain on the Nuscenes dataset.

153. 【2602.01273】Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

链接https://arxiv.org/abs/2602.01273

作者:Xun Zhang,Kaicheng Yang,Hongliang Lu,Haotong Qin,Yong Guo,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:heavy inference burden, inference burden hinders, generate high-quality textures, hinders real-world deployment, Real-World Image Super-Resolution

备注: Our code and models will be available at [this https URL](https://github.com/xunzhang1128/Q-DiT4SR)

点击查看摘要

Abstract:Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at this https URL.

154. 【2602.01268】OASIS-DC: Generalizable Depth Completion via Output-level Alignment of Sparse-Integrated Monocular Pseudo Depth

链接https://arxiv.org/abs/2602.01268

作者:Jaehyeon Cho,Jhonghyun An

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Recent monocular foundation, Recent monocular, zero-shot depth estimation, monocular foundation models, foundation models excel

备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.

155. 【2602.01257】Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment

链接https://arxiv.org/abs/2602.01257

作者:Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, Point-based Text Refinement, Text Refinement, gained significant, significant attention

备注

点击查看摘要

Abstract:Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.

156. 【2602.01219】MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-$k$ Activations

链接https://arxiv.org/abs/2602.01219

作者:Qishuai Wen,Zhiyuan Huang,Xianghan Meng,Wei He,Chun-Guang Li

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:two-layer fast-weight MLP, width equals sequence, equals sequence length, operator in Transformers, N-width MLP increases

备注

点击查看摘要

Abstract:The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

157. 【2602.01212】SimpleGPT: Improving GPT via A Simple Normalization Strategy

链接https://arxiv.org/abs/2602.01212

作者:Marco Chen,Xianbiao Qi,Yelin He,Jiaquan Ye,Rong Xiao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:revisit Transformer optimization, maximum tolerable learning, revisit Transformer, tolerable learning rate, Hessian matrix

备注: We propose SimpleGPT, a simple yet effective GPT model, and provide theoretical insights into its mathematical foundations. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B

点击查看摘要

Abstract:In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at this https URL.

158. 【2602.01200】Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

链接https://arxiv.org/abs/2602.01200

作者:Haoran Lai,Zihang Jiang,Kun Zhang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Shaohua Kevin Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:superficial report patterns, overfit superficial report, interpretability-aware reward designs, volumetric medical imaging, reinforcement learning framework

备注

点击查看摘要

Abstract:Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92\% on CT-RATE and 44.99\% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.

159. 【2602.01194】EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting

链接https://arxiv.org/abs/2602.01194

作者:Hao Chen,Tao Han,Jie Zhang,Song Guo,Fenghua Ling,Lei Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:disaster preparedness, critical for socioeconomic, socioeconomic planning, planning and disaster, Efficient Multi-scale Transformer

备注

点击查看摘要

Abstract:Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules.

160. 【2602.01193】Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

链接https://arxiv.org/abs/2602.01193

作者:Shashini Nilukshi,Deshan Sumanathilaka

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Word Sense Disambiguation, traditional Word Sense, Visual Word Sense, Word Sense, Sense Disambiguation

备注: 2 figures, 2 Tables, Accepted at IEEE TIC 2026

点击查看摘要

Abstract:This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8\% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.

161. 【2602.01183】Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion

链接https://arxiv.org/abs/2602.01183

作者:Chunming He,Rihan Zhang,Fengyang Xiao,Dingming Zhang,Zhiwen Cao,Sina Farsiu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:gradually reinforcing perception, Biological learning proceeds, difficult tasks, gradually reinforcing, Biological learning

备注: 8 figures, 11 tables

点击查看摘要

Abstract:Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.

162. 【2602.01173】EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

链接https://arxiv.org/abs/2602.01173

作者:Lancheng Gao,Ziheng Jia,Zixuan Xing,Wei Sun,Huiyu Duan,Guangtao Zhai,Xiongkuo Min

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human-computer interaction applications, advancing machine empathy, empowering diverse human-computer, diverse human-computer interaction, interaction applications

备注

点击查看摘要

Abstract:Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125k$ images via automated generation, alongside a $36k$ dataset (EEmoDB-Assess) curated from $25k$ images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at this https URL.

163. 【2602.01163】Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models

链接https://arxiv.org/abs/2602.01163

作者:Chunliang Hua,Zeyuan Yang,Lei Zhang,Jiayang Sun,Fengwen Chen,Chunlan Zeng,Xiao Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Safe UAV emergency, Safe UAV, identifying flat terrain, UAV emergency landing, demands understanding complex

备注

点击查看摘要

Abstract:Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at this https URL.

164. 【2602.01158】Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

链接https://arxiv.org/abs/2602.01158

作者:Daniel Yezid Guarnizo Orjuela,Leonardo Scappatura,Veronica Di Gennaro,Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:generalist robotic manipulation, robotic manipulation, unifying perception, dominant paradigm, paradigm for generalist

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $\pi_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.

165. 【2602.01150】Statistical MIA: Rethinking Membership Inference Attack for Reliable Unlearning Auditing

链接https://arxiv.org/abs/2602.01150

作者:Jialong Sun,Zeming Wei,Jiaxuan Zou,Jiacheng Gong,Guanheng Wang,Chengyang Dong,Jialong Li,Bo Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:machine learning systems, learning systems, Membership Inference, essential for enforcing, Membership Inference Attacks

备注

点击查看摘要

Abstract:Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearning auditing, where samples that evade membership detection are often regarded as successfully forgotten. After carefully revisiting the reliability of MIA, we show that this assumption is flawed: failed membership inference does not imply true forgetting. We theoretically demonstrate that MIA-based auditing, when formulated as a binary classification problem, inevitably incurs statistical errors whose magnitude cannot be observed during the auditing process. This leads to overly optimistic evaluations of unlearning performance, while incurring substantial computational overhead due to shadow model training. To address these limitations, we propose Statistical Membership Inference Attack (SMIA), a novel training-free and highly effective auditing framework. SMIA directly compares the distributions of member and non-member data using statistical tests, eliminating the need for learned attack models. Moreover, SMIA outputs both a forgetting rate and a corresponding confidence interval, enabling quantified reliability of the auditing results. Extensive experiments show that SMIA provides more reliable auditing with significantly lower computational cost than existing MIA-based approaches. Notably, the theoretical guarantees and empirical effectiveness of SMIA suggest it as a new paradigm for reliable machine unlearning auditing.

166. 【2602.01127】Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis

链接https://arxiv.org/abs/2602.01127

作者:Matej Suchanek,Klara Janouskova,Ondrej Vasatko,Jiri Matas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:powerful general-purpose representations, Visual-language models, Linear Discriminant Analysis, provide powerful general-purpose, exhibiting limited class

备注

点击查看摘要

Abstract:Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.01127 [cs.CV]

(or
arXiv:2602.01127v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.01127

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
167. 【2602.01118】LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions

链接https://arxiv.org/abs/2602.01118

作者:Jingjing Wang,Qirui Hu,Chong Bao,Yuke Zhu,Hujun Bao,Zhaopeng Cui,Guofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Inverse rendering, digital twins, scenes is pivotal, pivotal for applications, applications like autonomous

备注

点击查看摘要

Abstract:Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.

168. 【2602.01115】KAN We Flow? Advancing Robotic Manipulation with 3D Flow Matching via KAN RWKV

链接https://arxiv.org/abs/2602.01115

作者:Zhihao Chen,Yiyuan Ge,Ziyang Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion-based visuomotor policies, visuomotor policies excel, Diffusion-based visuomotor, modeling action distributions, resource-constrained robots

备注: Accepted By ICRA2026

点击查看摘要

Abstract:Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient time/channel mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8\%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks. Our project page can be viewed in \href{this https URL}{\textcolor{red}{link}}

169. 【2602.01101】Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning

链接https://arxiv.org/abs/2602.01101

作者:Felix Breiteneder,Mohammad Belal,Muhammad Saad Saeed,Shahed Masoudian,Usman Naseem,Kulshrestha Juhi,Markus Schedl,Shah Nawaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Internet memes, tools for communication, capable of spreading, spreading political, sociocultural ideas

备注: Accepted at WWW2026

点击查看摘要

Abstract:Internet memes are powerful tools for communication, capable of spreading political, psychological, and sociocultural ideas. However, they can be harmful and can be used to disseminate hate toward targeted individuals or groups. Although previous studies have focused on designing new detection methods, these often rely on modal-complete data, such as text and images. In real-world settings, however, modalities like text may be missing due to issues like poor OCR quality, making existing methods sensitive to missing information and leading to performance deterioration. To address this gap, in this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of harmful meme detection methods in the presence of modal-incomplete data. Specifically, we propose a new baseline method that learns a shared representation for multiple modalities by projecting them independently. These shared representations can then be leveraged when data is modal-incomplete. Experimental results on two benchmark datasets demonstrate that our method outperforms existing approaches when text is missing. Moreover, these results suggest that our method allows for better integration of visual features, reducing dependence on text and improving robustness in scenarios where textual information is missing. Our work represents a significant step forward in enabling the real-world application of harmful meme detection, particularly in situations where a modality is absent.

170. 【2602.01095】PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space

链接https://arxiv.org/abs/2602.01095

作者:Jinghong Zheng,Changlong Jiang,Yang Xiao,Jiaqi Li,Haohong Kuang,Hang Xu,Ran Wang,Zhiguo Cao,Min Du,Joey Tianyi Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single RGB image, single RGB, RGB image, human pose lifting, human pose

备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7\%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.

171. 【2602.01089】Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models

链接https://arxiv.org/abs/2602.01089

作者:Zhiqi Zhang,Xinhao Zhong,Yi Sun,Shuoyang Sun,Bin Chen,Shu-Tao Xia,Xuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses growing concerns, demonstrated remarkable capabilities, reproduce undesirable concepts, generating high-quality images, flow matching models

备注

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images, yet their tendency to reproduce undesirable concepts, such as NSFW content, copyrighted styles, or specific objects, poses growing concerns for safe and controllable deployment. While existing concept erasure approaches primarily focus on DDPM-based diffusion models and rely on costly fine-tuning, the recent emergence of flow matching models introduces a fundamentally different generative paradigm for which prior methods are not directly applicable. In this paper, we propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models. Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow. Leveraging this observation, we construct a differential vector field that characterizes the directional discrepancy between a target concept and a carefully chosen anchor concept. During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics. Extensive experiments on FLUX demonstrate that DVE consistently outperforms existing baselines on a wide range of concept erasure tasks, including NSFW suppression, artistic style removal, and object erasure, while preserving image quality and diversity.

172. 【2602.01081】MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization

链接https://arxiv.org/abs/2602.01081

作者:Haitao Zhang,Yingying Wang,Jiaxiang Wang,Haote Xu,Hongyang Zhang,Yirong Chen,Yue Huang,Xinghao Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical Anomaly Detection, Large Multimodal Models, Anomaly Detection, enhance diagnostic accuracy, accuracy using Large

备注: 9 pages, 4 figures

点击查看摘要

Abstract:Medical Anomaly Detection (MedAD) presents a significant opportunity to enhance diagnostic accuracy using Large Multimodal Models (LMMs) to interpret and answer questions based on medical images. However, the reliance on Supervised Fine-Tuning (SFT) on simplistic and fragmented datasets has hindered the development of models capable of plausible reasoning and robust multimodal generalization. To overcome this, we introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs. On this foundation, we propose a two-stage training framework. The first stage, Cognitive Injection, uses SFT to instill foundational medical knowledge and align the model with a structured think-then-answer paradigm. Given that standard policy optimization can produce reasoning that is disconnected from the final answer, the second stage incorporates Consistency Group Relative Policy Optimization (Con-GRPO). This novel algorithm incorporates a crucial consistency reward to ensure the generated reasoning process is relevant and logically coherent with the final diagnosis. Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10\%. This superior performance stems from its ability to generate transparent and logically consistent reasoning pathways, offering a promising approach to enhancing the trustworthiness and interpretability of AI for clinical decision support.

173. 【2602.01077】PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

链接https://arxiv.org/abs/2602.01077

作者:Haopeng Li,Shitong Shao,Wenliang Zhong,Zikai Zhou,Lichen Bai,Hui Xiong,Zeke Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, Transformers are fundamental, sparse attention, fundamental for video, efficiency is bottlenecked

备注: 17 pages

点击查看摘要

Abstract:Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: this https URL.

174. 【2602.01069】PDE-Constrained Optimization for Neural Image Segmentation with Physics Priors

链接https://arxiv.org/abs/2602.01069

作者:Seema K. Poudel,Sunny K. Khadka

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:weak object boundaries, ill-posed inverse problem, inverse problem due, limited labeled data, measurement noise

备注

点击查看摘要

Abstract:Segmentation of microscopy images constitutes an ill-posed inverse problem due to measurement noise, weak object boundaries, and limited labeled data. Although deep neural networks provide flexible nonparametric estimators, unconstrained empirical risk minimization often leads to unstable solutions and poor generalization. In this work, image segmentation is formulated as a PDE-constrained optimization problem that integrates physically motivated priors into deep learning models through variational regularization. The proposed framework minimizes a composite objective function consisting of a data fidelity term and penalty terms derived from reaction-diffusion equations and phase-field interface energies, all implemented as differentiable residual losses. Experiments are conducted on the LIVECell dataset, a high-quality, manually annotated collection of phase-contrast microscopy images. Training is performed on two cell types, while evaluation is carried out on a distinct, unseen cell type to assess generalization. A UNet architecture is used as the unconstrained baseline model. Experimental results demonstrate consistent improvements in segmentation accuracy and boundary fidelity compared to unconstrained deep learning baselines. Moreover, the PDE-regularized models exhibit enhanced stability and improved generalization in low-sample regimes, highlighting the advantages of incorporating structured priors. The proposed approach illustrates how PDE-constrained optimization can strengthen data-driven learning frameworks, providing a principled bridge between variational methods, statistical learning, and scientific machine learning.

175. 【2602.01059】DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification

链接https://arxiv.org/abs/2602.01059

作者:Ying Shu,Pujian Zhan,Huiqi Yang,Hehe Fan,Youfang Lin,Kai Lv

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:person re-identification challenges, fine-grained discriminative details, solving person re-identification, re-identification challenges, pose variations

备注

点击查看摘要

Abstract:Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.

176. 【2602.01057】Radioactive 3D Gaussian Ray Tracing for Tomographic Reconstruction

链接https://arxiv.org/abs/2602.01057

作者:Ling Chen,Bao Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:promising rendering technique, Elliptical Weighted Average, rendering technique, recently emerged, emerged in computer

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged in computer vision as a promising rendering technique. By adapting the principles of Elliptical Weighted Average (EWA) splatting to a modern differentiable pipeline, 3DGS enables real-time, high-quality novel view synthesis. Building upon this, R2-Gaussian extended the 3DGS paradigm to tomographic reconstruction by rectifying integration bias, achieving state-of-the-art performance in computed tomography (CT). To enable differentiability, R2-Gaussian adopts a local affine approximation: each 3D Gaussian is locally mapped to a 2D Gaussian on the detector and composed via alpha blending to form projections. However, the affine approximation can degrade reconstruction quantitative accuracy and complicate the incorporation of nonlinear geometric corrections. To address these limitations, we propose a tomographic reconstruction framework based on 3D Gaussian ray tracing. Our approach provides two key advantages over splatting-based models: (i) it computes the line integral through 3D Gaussian primitives analytically, avoiding the local affine collapse and thus yielding a more physically consistent forward projection model; and (ii) the ray-tracing formulation gives explicit control over ray origins and directions, which facilitates the precise application of nonlinear geometric corrections, e.g., arc-correction used in positron emission tomography (PET). These properties extend the applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.

177. 【2602.01055】Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis

链接https://arxiv.org/abs/2602.01055

作者:Bo Deng,Yitong Tang,Jiake Li,Yuxin Huang,Li Wang,Yu Zhang,Yufei Zhan,Hua Lu,Xiaoshen Zhang,Jieyun Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:imaging exhibits substantial, exhibits substantial heterogeneity, posing significant challenges, generalizable analysis models, Ultrasound Image Analysis

备注

点击查看摘要

Abstract:Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM\_UIA~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM\_UIA~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet--B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \href{this https URL}{GitHub}.

178. 【2602.01047】Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

链接https://arxiv.org/abs/2602.01047

作者:Xinrong Chen,Xu Chu,Yingmin Qiu,Hengyuan Zhang,Jing Xiong,Shiyu Tang,Shuai Liu,Shaokang Yang,Cheng Yang,Hayden Kwok-Hay So,Ngai Wong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, multimodal tasks, Large

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.

179. 【2602.01046】ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction

链接https://arxiv.org/abs/2602.01046

作者:Jiawei Lin,Shizhao Sun,Danqing Huang,Ting Liu,Ji Li,Jiang Bian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:manual adjustments marks, key step forward, Automated redesign, editing, manual adjustments

备注

点击查看摘要

Abstract:Automated redesign without manual adjustments marks a key step forward in the design workflow. In this work, we focus on a foundational redesign task termed design layout editing, which seeks to autonomously modify the geometric composition of a design based on user intents. To overcome the ambiguity of user needs expressed in natural language, we introduce four basic and important editing actions and standardize the format of editing operations. The underexplored task presents a unique challenge: satisfying specified editing operations while simultaneously preserving the layout structure of unedited elements. Besides, the scarcity of triplet (original design, editing operation, edited design) samples poses another formidable challenge. To this end, we present ReLayout, a novel framework for versatile and structure-preserving design layout editing that operates without triplet data. Specifically, ReLayout first introduces the relation graph, which contains the position and size relationships among unedited elements, as the constraint for layout structure preservation. Then, relation-aware design reconstruction (RADR) is proposed to bypass the data challenge. By learning to reconstruct a design from its elements, a relation graph, and a synthesized editing operation, RADR effectively emulates the editing process in a self-supervised manner. A multi-modal large language model serves as the backbone for RADR, unifying multiple editing actions within a single model and thus achieving versatile editing after fine-tuning. Qualitative, quantitative results and user studies show that ReLayout significantly outperforms the baseline models in terms of editing quality, accuracy, and layout structure preservation.

180. 【2602.01038】From Videos to Conversations: Egocentric Instructions for Task Assistance

链接https://arxiv.org/abs/2602.01038

作者:Lavisha Aggarwal,Vikas Bahirwani,Andrea Colaco

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require expert knowledge, multi-step procedures, ranging from appliance, car maintenance, require expert

备注

点击查看摘要

Abstract:Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.

181. 【2602.01037】VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models

链接https://arxiv.org/abs/2602.01037

作者:Guangshuo Qin,Zhiteng Li,Zheng Chen,Weihang Zhang,Linghe Kong,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:making compression essential, offer remarkable performance, Vision-Language Models, incur prohibitive memory, offer remarkable

备注

点击查看摘要

Abstract:Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04\% on Kimi-VL and 3.09\% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at this https URL.

182. 【2602.01035】FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence

链接https://arxiv.org/abs/2602.01035

作者:Chentian Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point cloud, Real-time multi-view point, robotic navigation, digital twins, multi-view point cloud

备注: A 5-page paper, prepared for submission to the 2026 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.

183. 【2602.01033】GMAC: Global Multi-View Constraint for Automatic Multi-Camera Extrinsic Calibration

链接https://arxiv.org/abs/2602.01033

作者:Chentian Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-view data fusion, spatial extrinsic parameters, Automatic calibration, panoramic perception, data fusion

备注: A 5-page paper with 1 figure, prepared for submission to the 2026 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Automatic calibration of multi-camera systems, namely the accurate estimation of spatial extrinsic parameters, is fundamental for 3D reconstruction, panoramic perception, and multi-view data fusion. Existing methods typically rely on calibration targets, explicit geometric modeling, or task-specific neural networks. Such approaches often exhibit limited robustness and applicability in complex dynamic environments or online scenarios, making them difficult to deploy in practical applications. To address this, this paper proposes GMAC, a multi-camera extrinsic estimation framework based on the implicit geometric representations learned by multi-view reconstruction networks. GMAC models extrinsics as global variables constrained by the latent multi-view geometric structure and prunes and structurally reconfigures existing networks so that their latent features can directly support extrinsic prediction through a lightweight regression head, without requiring a completely new network design. Furthermore, GMAC jointly optimizes cross-view reprojection consistency and multi-view cycle consistency, ensuring geometric coherence across cameras while improving prediction accuracy and optimization stability. Experiments on both synthetic and real-world multi-camera datasets demonstrate that GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration, providing a new solution for efficient deployment and online calibration of multi-camera systems.

184. 【2602.01025】oward Universal and Transferable Jailbreak Attacks on Vision-Language Models

链接https://arxiv.org/abs/2602.01025

作者:Kaiyuan Cui,Yige Li,Yutao Wu,Xingjun Ma,Sarah Erfani,Christopher Leckie,Hanxun Huang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:text generation conditioned, extend large language, large language models, enabling text generation, Vision-language models

备注: ICLR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{this https URL}{GitHub repository}.

185. 【2602.01020】Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning

链接https://arxiv.org/abs/2602.01020

作者:Jichen Yang,Jikai Zhang,Benjamin Wildman-Tobriner,Maciej A. Mazurowski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning, utilizes ultrasound images, deep learning models, deep learning model, cancers commonly utilizes

备注: 9 pages, 3 figures

点击查看摘要

Abstract:The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists' performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms' performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.

186. 【2602.01012】LocalScore: Local Density-Aware Similarity Scoring for Biometrics

链接https://arxiv.org/abs/2602.01012

作者:Yiyang Su,Minchul Kim,Jie Zhu,Christopher Perry,Feng Liu,Anil Jain,Xiaoming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:biometrics faces challenges, Open-set biometrics faces, probe subjects, non-mated probes, traditional biometric systems

备注

点击查看摘要

Abstract:Open-set biometrics faces challenges with probe subjects who may not be enrolled in the gallery, as traditional biometric systems struggle to detect these non-mated probes. Despite the growing prevalence of multi-sample galleries in real-world deployments, most existing methods collapse intra-subject variability into a single global representation, leading to suboptimal decision boundaries and poor open-set robustness. To address this issue, we propose LocalScore, a simple yet effective scoring algorithm that explicitly incorporates the local density of the gallery feature distribution using the k-th nearest neighbors. LocalScore is architecture-agnostic, loss-independent, and incurs negligible computational overhead, making it a plug-and-play solution for existing biometric systems. Extensive experiments across multiple modalities demonstrate that LocalScore consistently achieves substantial gains in open-set retrieval (FNIR@FPIR reduced from 53% to 40%) and verification (TAR@FAR improved from 51% to 74%). We further provide theoretical analysis and empirical validation explaining when and why the method achieves the most significant gains based on dataset characteristics.

187. 【2602.01004】SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

链接https://arxiv.org/abs/2602.01004

作者:Zihao Zhao,Shengting Cao,Muchao Ye

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, shown promising effectiveness, video anomaly understanding, Multi-modal large language, demonstrated significant progress

备注

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.

188. 【2602.01000】CortiNet: A Physics-Perception Hybrid Cortical-Inspired Dual-Stream Network for Gallbladder Disease Diagnosis from Ultrasound

链接https://arxiv.org/abs/2602.01000

作者:Vagish Kumar,Souvik Chakraborty

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:primary diagnostic modality, Gallbladder diseases due, detecting Gallbladder diseases, non-invasive nature, wide accessibility

备注

点击查看摘要

Abstract:Ultrasound imaging is the primary diagnostic modality for detecting Gallbladder diseases due to its non-invasive nature, affordability, and wide accessibility. However, the low resolution and speckle noise inherent to ultrasound images hinder diagnostic reliability, prompting the use of large convolutional neural networks that are difficult to deploy in routine clinical settings. In this work, we propose CortiNet, a lightweight, cortical-inspired dual-stream neural architecture for gallbladder disease diagnosis that integrates physically interpretable multi-scale signal decomposition with perception-driven feature learning. Inspired by parallel processing pathways in the human visual cortex, CortiNet explicitly separates low-frequency structural information from high-frequency perceptual details and processes them through specialized encoding streams. By operating directly on structured, frequency-selective representations rather than raw pixel intensities, the architecture embeds strong physics-based inductive bias, enabling efficient feature learning with a significantly reduced parameter footprint. A late-stage cortical-style fusion mechanism integrates complementary structural and textural cues while preserving computational efficiency. Additionally, we propose a structure-aware explainability framework wherein gradient-weighted class activation mapping is only applied to the structural branch of the proposed CortiNet architecture. This choice allows the model to only focus on the structural features, making it robust against speckle noise. We evaluate CortiNet on 10,692 expert-annotated images spanning nine clinically relevant gallbladder disease categories. Experimental results demonstrate that CortiNet achieves high diagnostic accuracy (98.74%) with only a fraction of the parameters required by conventional deep convolutional models.

189. 【2602.00995】VAMOS-OCTA: Vessel-Aware Multi-Axis Orthogonal Supervision for Inpainting Motion-Corrupted OCT Angiography Volumes

链接https://arxiv.org/abs/2602.00995

作者:Nick DiSanto,Ehsan Khodapanah Aghdam,Han Liu,Jacob Watson,Yuankai K. Tao,Hao Li,Ipek Oguz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Handheld Optical Coherence, Coherence Tomography Angiography, Optical Coherence Tomography, Handheld Optical, Tomography Angiography

备注: Accepted to SPIE Medical Imaging 2026

点击查看摘要

Abstract:Handheld Optical Coherence Tomography Angiography (OCTA) enables noninvasive retinal imaging in uncooperative or pediatric subjects, but is highly susceptible to motion artifacts that severely degrade volumetric image quality. Sudden motion during 3D acquisition can lead to unsampled retinal regions across entire B-scans (cross-sectional slices), resulting in blank bands in en face projections. We propose VAMOS-OCTA, a deep learning framework for inpainting motion-corrupted B-scans using vessel-aware multi-axis supervision. We employ a 2.5D U-Net architecture that takes a stack of neighboring B-scans as input to reconstruct a corrupted center B-scan, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss. This loss combines vessel-weighted intensity reconstruction with axial and lateral projection consistency, encouraging vascular continuity in native B-scans and across orthogonal planes. Unlike prior work that focuses primarily on restoring the en face MIP, VAMOS-OCTA jointly enhances both cross-sectional B-scan sharpness and volumetric projection accuracy, even under severe motion corruptions. We trained our model on both synthetic and real-world corrupted volumes and evaluated its performance using both perceptual quality and pixel-wise accuracy metrics. VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections. These results demonstrate that multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data. Our source code is available at this https URL.

190. 【2602.00982】Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025

链接https://arxiv.org/abs/2602.00982

作者:Phu-Hoa Pham,Chi-Nguyen Tran,Dao Sy Duy Minh,Nguyen Lam Phu Quy,Huynh Trung Kiet

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)

关键词:biological vision systems, match biological vision, Visual Foraging Competition, alignment remain critical, remain critical challenges

备注: 15 pages, 8 tables. Technical Report for winning solutions (Track 1 Track 2) at the NeurIPS 2025 Mouse vs. AI Challenge

点击查看摘要

Abstract:Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.

191. 【2602.00971】Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

链接https://arxiv.org/abs/2602.00971

作者:Meng Luo,Bobo Li,Shanqing Xu,Shize Zhang,Qiuchan Chen,Menglu Han,Wenhao Chen,Yanxiang Huang,Hao Fei,Mong-Li Lee,Wynne Hsu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, understanding remains limited, large language models, remains limited, rapid progress

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: this https URL.

192. 【2602.00956】Hybrid Topological and Deep Feature Fusion for Accurate MRI-Based Alzheimer's Disease Severity Classification

链接https://arxiv.org/abs/2602.00956

作者:Faisal Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:decision support systems, neuroimaging-based clinical decision, clinical decision support, Alzheimer disease, Alzheimer disease classification

备注: 20 pages, 6 Figures

点击查看摘要

Abstract:Early and accurate diagnosis of Alzheimer's disease (AD) remains a critical challenge in neuroimaging-based clinical decision support systems. In this work, we propose a novel hybrid deep learning framework that integrates Topological Data Analysis (TDA) with a DenseNet121 backbone for four-class Alzheimer's disease classification using structural MRI data from the OASIS dataset. TDA is employed to capture complementary topological characteristics of brain structures that are often overlooked by conventional neural networks, while DenseNet121 efficiently learns hierarchical spatial features from MRI slices. The extracted deep and topological features are fused to enhance class separability across the four AD stages. Extensive experiments conducted on the OASIS-1 Kaggle MRI dataset demonstrate that the proposed TDA+DenseNet121 model significantly outperforms existing state-of-the-art approaches. The model achieves an accuracy of 99.93% and an AUC of 100%, surpassing recently published CNN-based, transfer learning, ensemble, and multi-scale architectures. These results confirm the effectiveness of incorporating topological insights into deep learning pipelines and highlight the potential of the proposed framework as a robust and highly accurate tool for automated Alzheimer's disease diagnosis.

Comments:
20 pages, 6 Figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.00956 [cs.CV]

(or
arXiv:2602.00956v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.00956

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
193. 【2602.00949】Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images

链接https://arxiv.org/abs/2602.00949

作者:Xiang Zhang,Boxuan Zhang,Alireza Naghizadeh,Mohab Mohamed,Dongfang Liu,Ruixiang Tang,Dimitris Metaxas,Dongfang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Chimeric antigen receptor, cell immunological synapse, transformed cancer treatment, recent studies suggest, Chimeric antigen

备注

点击查看摘要

Abstract:Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.

194. 【2602.00946】ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

链接https://arxiv.org/abs/2602.00946

作者:Dhruv Parikh,Haoyang Fan,Rajgopal Kannan,Viktor Prasanna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:LLM processes hundreds, Vision-Language Models, largely redundant visual, LLM processes, processes hundreds

备注: Technical Report

点击查看摘要

Abstract:Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier -- preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.

195. 【2602.00937】CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

链接https://arxiv.org/abs/2602.00937

作者:I-Chun Arthur Liu,Krzysztof Choromanski,Sandy Huang,Connor Schenck

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:behavior cloning policies, achieved great success, Robotic Manipulation Pretraining, Leveraging pre-trained, robotic manipulation

备注

点击查看摘要

Abstract:Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.

196. 【2602.00904】OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

链接https://arxiv.org/abs/2602.00904

作者:Kunal Mahatha,Ali Bahri,Pierre Marza,Sahar Dastani,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz,Christian Desrosiers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:State space models, State space, modeling global relationships, recently emerged, alternative to transformers

备注

点击查看摘要

Abstract:State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

197. 【2602.00883】DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models

链接https://arxiv.org/abs/2602.00883

作者:Alicja Polowczyk,Agnieszka Polowczyk,Piotr Borycki,Joanna Waczyńska,Jacek Tabor,Przemysław Spurek

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:anatomical artifacts remain, results from recent, visual and anatomical, impressive results, remain a significant

备注

点击查看摘要

Abstract:Despite impressive results from recent text-to-image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post-hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time-consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training-free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at this https URL

198. 【2602.00865】Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware

链接https://arxiv.org/abs/2602.00865

作者:Brandon Leblanc,Charalambos Poullis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:globally consistent geometry, inferring globally consistent, foundation models capable, reconstruction has shifted, consistent geometry

备注: Submitted to the Canadian Conference on Robotics and Vision (CRV). 10 pages, 5 figures

点击查看摘要

Abstract:While multi-view 3D reconstruction has shifted toward large-scale foundation models capable of inferring globally consistent geometry, their reliance on massive computational clusters for training has created a significant barrier to entry for most academic laboratories. To bridge this compute divide, we introduce Distill3R, a framework designed to distill the geometric reasoning of 3D foundation models into compact students fully trainable on a single workstation. Our methodology centers on two primary innovations: (1) an offline caching pipeline that decouples heavy teacher inference from the training loop through compressed supervision signals, and (2) a confidence-aware distillation loss that leverages teacher uncertainty to enable training on commodity hardware. We propose a 72M-parameter student model which achieves a 9x reduction in parameters and a 5x inference speedup compared to its 650M-parameter teacher. The student is fully trainable in under 3 days on a single workstation, whereas its teacher requires massive GPU clusters for up to a week. We demonstrate that the student preserves the structural consistency and qualitative geometric understanding required for functional 3D awareness. By providing a reproducible, single-workstation training recipe, Distill3R serves as an exploratory entry point for democratized 3D vision research and efficient edge deployment. This work is not intended to compete with state-of-the-art foundation models, but to provide an accessible research baseline for laboratories without access to large-scale compute to train and specialize models on their own domain-specific data at minimal cost.

199. 【2602.00841】Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition

链接https://arxiv.org/abs/2602.00841

作者:Jintao Cheng,Weibin Li,Zhijian He,Jin Wu,Chi Man Vong,Wei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Place Recognition, Visual Place, Place Recognition, demands representations robust, demands representations

备注: 14pages, 5 figures

点击查看摘要

Abstract:Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.

200. 【2602.00839】ransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation

链接https://arxiv.org/abs/2602.00839

作者:Mingwei Li,Hehe Fan,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular normal estimation, remains challenging due, complex light refraction, laboratory automation, refraction and reflection

备注: Project Page: [this https URL](https://longxiang-ai.github.io/TransNormal)

点击查看摘要

Abstract:Monocular normal estimation for transparent objects is critical for laboratory automation, yet it remains challenging due to complex light refraction and reflection. These optical properties often lead to catastrophic failures in conventional depth and normal sensors, hindering the deployment of embodied AI in scientific environments. We propose TransNormal, a novel framework that adapts pre-trained diffusion priors for single-step normal regression. To handle the lack of texture in transparent surfaces, TransNormal integrates dense visual semantics from DINOv3 via a cross-attention mechanism, providing strong geometric cues. Furthermore, we employ a multi-task learning objective and wavelet-based regularization to ensure the preservation of fine-grained structural details. To support this task, we introduce TransNormal-Synthetic, a physics-based dataset with high-fidelity normal maps for transparent labware. Extensive experiments demonstrate that TransNormal significantly outperforms state-of-the-art methods: on the ClearGrasp benchmark, it reduces mean error by 24.4% and improves 11.25° accuracy by 22.8%; on ClearPose, it achieves a 15.2% reduction in mean error. The code and dataset will be made publicly available at this https URL.

201. 【2602.00821】Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis

链接https://arxiv.org/abs/2602.00821

作者:Konstantinos Moutselos,Ilias Maglogiannis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving diagnostic features, protecting patient privacy, diagnostic features, Federated Learning, dermatology is hindered

备注: 8 pages, 5 figures

点击查看摘要

Abstract:The deployment of Federated Learning (FL) for clinical dermatology is hindered by the competing requirements of protecting patient privacy and preserving diagnostic features. Traditional de-identification methods often degrade pathological fidelity, while standard generative editing techniques rely on computationally intensive inversion processes unsuitable for resource-constrained edge devices. We propose a framework for identity-agnostic pathology preservation that serves as a client-side privacy-preserving utility. By leveraging inversion-free Rectified Flow Transformers (FlowEdit), the system performs high-fidelity identity transformation in near real-time (less than 20s), facilitating local deployment on clinical nodes. We introduce a "Segment-by-Synthesis" mechanism that generates counterfactual healthy and pathological twin pairs locally. This enables the extraction of differential erythema masks that are decoupled from biometric markers and semantic artifacts (e.g. jewelry). Pilot validation on high-resolution clinical samples demonstrates an Intersection over Union (IoU) stability greater than 0.67 across synthetic identities. By generating privacy-compliant synthetic surrogates at the edge, this framework mitigates the risk of gradient leakage at the source, providing a secure pathway for high-precision skin image analysis in federated environments.

202. 【2602.00814】SyNeT: Synthetic Negatives for Traversability Learning

链接https://arxiv.org/abs/2602.00814

作者:Bomena Kim,Hojun Lee,Younsoo Park,Yaoyu Hu,Sebastian Scherer,Inwook Shim

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable traversability estimation, navigate complex outdoor, outdoor environments safely, complex outdoor environments, Reliable traversability

备注

点击查看摘要

Abstract:Reliable traversability estimation is crucial for autonomous robots to navigate complex outdoor environments safely. Existing self-supervised learning frameworks primarily rely on positive and unlabeled data; however, the lack of explicit negative data remains a critical limitation, hindering the model's ability to accurately identify diverse non-traversable regions. To address this issue, we introduce a method to explicitly construct synthetic negatives, representing plausible but non-traversable, and integrate them into vision-based traversability learning. Our approach is formulated as a training strategy that can be seamlessly integrated into both Positive-Unlabeled (PU) and Positive-Negative (PN) frameworks without modifying inference architectures. Complementing standard pixel-wise metrics, we introduce an object-centric FPR evaluation approach that analyzes predictions in regions where synthetic negatives are inserted. This evaluation provides an indirect measure of the model's ability to consistently identify non-traversable regions without additional manual labeling. Extensive experiments on both public and self-collected datasets demonstrate that our approach significantly enhances robustness and generalization across diverse environments. The source code and demonstration videos will be publicly available.

203. 【2602.00813】Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

链接https://arxiv.org/abs/2602.00813

作者:Tong Wang,Yunhan Zhao,Shu Kong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Image Retrieval, mental image, Image, Composed Image, Image Retrieval

备注

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

204. 【2602.00810】VVLoc: Prior-free 3-DoF Vehicle Visual Localization

链接https://arxiv.org/abs/2602.00810

作者:Ze Huang,Zhongyang Xiao,Mingliang Song,Longan Yang,Hongyuan Yuan,Li Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:precise spatial coordinates, similar map keyframe, autonomous driving, spatial coordinates, critical technology

备注

点击查看摘要

Abstract:Localization is a critical technology in autonomous driving, encompassing both topological localization, which identifies the most similar map keyframe to the current observation, and metric localization, which provides precise spatial coordinates. Conventional methods typically address these tasks independently, rely on single-camera setups, and often require additional 3D semantic or pose priors, while lacking mechanisms to quantify the confidence of localization results, making them less feasible for real industrial applications. In this paper, we propose VVLoc, a unified pipeline that employs a single neural network to concurrently achieve topological and metric vehicle localization using multi-camera system. VVLoc first evaluates the geo-proximity between visual observations, then estimates their relative metric poses using a matching strategy, while also providing a confidence measure. Additionally, the training process for VVLoc is highly efficient, requiring only pairs of visual data and corresponding ground-truth poses, eliminating the need for complex supplementary data. We evaluate VVLoc not only on the publicly available datasets, but also on a more challenging self-collected dataset, demonstrating its ability to deliver state-of-the-art localization accuracy across a wide range of localization tasks.

205. 【2602.00807】Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

链接https://arxiv.org/abs/2602.00807

作者:Xianzhe Fan,Shengliang Deng,Xiaoyang Wu,Yuxiang Lu,Zhuoling Li,Mi Yan,Yujia Zhang,Zhizheng Zhang,He Wang,Hengshuang Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:models typically, complex scenes, enhance VLA capabilities, limits their spatial, spatial understanding

备注

点击查看摘要

Abstract:Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at this https URL.

206. 【2602.00795】DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

链接https://arxiv.org/abs/2602.00795

作者:Wenhao Li,Xianjing Meng,Qiangchang Wang,Zhongyi Han,Zhibin Wu,Yilong Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot learning, aims to generalize, Dual-level Semantic Construction, Few-shot, Reinforcement Learning gating

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.

207. 【2602.00763】Evaluating Deep Learning-Based Nerve Segmentation in Brachial Plexus Ultrasound Under Realistic Data Constraints

链接https://arxiv.org/abs/2602.00763

作者:Dylan Yves,Khush Agarwal,Jonathan Hoyin Chan,Patcharapit Promoppatum,Aroonkamon Pattanasiricharoen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ultrasound-guided regional anesthesia, inter-patient anatomical variability, Accurate nerve localization, low image contrast, manual identification remains

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Accurate nerve localization is critical for the success of ultrasound-guided regional anesthesia, yet manual identification remains challenging due to low image contrast, speckle noise, and inter-patient anatomical variability. This study evaluates deep learning-based nerve segmentation in ultrasound images of the brachial plexus using a U-Net architecture, with a focus on how dataset composition and annotation strategy influence segmentation performance. We find that training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5) provides regularization benefits for lower-performing acquisition sources, though it does not surpass single-source training when matched to the target domain. Extending the task from binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle) results in decreased nerve-specific Dice scores, with performance drops ranging from 9% to 61% depending on dataset, likely due to class imbalance and boundary ambiguity. Additionally, we observe a moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p0.001), indicating that smaller nerves remain a primary challenge. These findings provide methodological guidance for developing robust ultrasound nerve segmentation systems under realistic clinical data constraints.

208. 【2602.00749】HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression

链接https://arxiv.org/abs/2602.00749

作者:Xiangming Wang,Benteng Sun,Yungeng Liu,Haijin Zeng,Yongyong Chen,Jingyong Su,Jie Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hyperspectral images, information beyond RGB, capture richer spatial-spectral, richer spatial-spectral information, HSI restoration

备注

点击查看摘要

Abstract:Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB, yet real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands. Existing generative approaches for HSI restoration like diffusion models require hundreds of iterative steps, making them computationally impractical for high-dimensional HSIs. While regression models tend to produce oversmoothed results, failing to preserve critical structural details. We break this impasse by introducing HSI-VAR, rethinking HSI restoration as an autoregressive generation problem, where spectral and spatial dependencies can be progressively modeled rather than globally reconstructed. HSI-VAR incorporates three key innovations: (1) Latent-condition alignment, which couples semantic consistency between latent priors and conditional embeddings for precise reconstruction; (2) Degradation-aware guidance, which uniquely encodes mixed degradations as linear combinations in the embedding space for automatic control, remarkably achieving a nearly $50\%$ reduction in computational cost at inference; (3) A spatial-spectral adaptation module that refines details across both domains in the decoding phase. Extensive experiments on nine all-in-one HSI restoration benchmarks confirm HSI-VAR's state-of-the-art performance, achieving a 3.77 dB PSNR improvement on \textbf{\textit{ICVL}} and offering superior structure preservation with an inference speed-up of up to $95.5 \times$ compared with diffusion-based methods, making it a highly practical solution for real-world HSI restoration.

209. 【2602.00746】Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

链接https://arxiv.org/abs/2602.00746

作者:Jianping Zhong,Guochang Li,Chen Zhi,Junxiao Han,Zhen Qin,Xinkui Zhao,Nan Wang,Shuiguang Deng,Jianwei Yin

类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, Language Models, long-context code due, struggle with long-context

备注

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.

Subjects:

Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.00746 [cs.SE]

(or
arXiv:2602.00746v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2602.00746

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
210. 【2602.00739】Diffusion-Driven Inter-Outer Surface Separation for Point Clouds with Open Boundaries

链接https://arxiv.org/abs/2602.00739

作者:Zhengyan Qin,Liyuan Qiu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Signed Distance Function, Truncated Signed Distance, Distance Function, Truncated Signed, Signed Distance

备注

点击查看摘要

Abstract:We propose a diffusion-based algorithm for separating the inter and outer layer surfaces from double-layered point clouds, particularly those exhibiting the "double surface artifact" caused by truncation in Truncated Signed Distance Function (TSDF) fusion during indoor or medical 3D reconstruction. This artifact arises from asymmetric truncation thresholds, leading to erroneous inter and outer shells in the fused volume, which our method addresses by extracting the true inter layer to mitigate challenges like overlapping surfaces and disordered normals. We focus on point clouds with \emph{open boundaries} (i.e., sampled surfaces with topological openings/holes through which particles may escape), rather than point clouds with \emph{missing surface regions} where no samples exist. Our approach enables robust processing of both watertight and open-boundary models, achieving extraction of the inter layer from 20,000 inter and 20,000 outer points in approximately 10 seconds. This solution is particularly effective for applications requiring accurate surface representations, such as indoor scene modeling and medical imaging, where double-layered point clouds are prevalent, and it accommodates both closed (watertight) and open-boundary surface geometries. Our goal is \emph{post-hoc} inter/outer shell separation as a lightweight module after TSDF fusion; we do not aim to replace full variational or learning-based reconstruction pipelines.

211. 【2602.00729】Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation

链接https://arxiv.org/abs/2602.00729

作者:Qihe Pan,Yiming Wu,Xing Zhao,Liang Xie,Guodao Sun,Ronghua Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently shown strong, shown strong progress, Diffusion models, generative tasks, models have recently

备注: This paper has been accepted for publication in the proceedings of 2026 IEEE ICASSP Conference

点击查看摘要

Abstract:Diffusion models have recently shown strong progress in generative tasks, offering a more stable alternative to GAN-based approaches for makeup transfer. Existing methods often suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability. To address these issues, we make three contributions. First, we construct a curated high-quality dataset using a train-generate-filter-retrain strategy that combines synthetic, realistic, and filtered samples to improve diversity and fidelity. Second, we design a diffusion-based framework that disentangles identity and makeup features, ensuring facial structure and skin tone are preserved while applying accurate and diverse cosmetic styles. Third, we propose a text-guided mechanism that allows fine-grained and region-specific control, enabling users to modify eyes, lips, or face makeup with natural language prompts. Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility. Examples of our dataset can be found at: this https URL.

212. 【2602.00703】StomataSeg: Semi-Supervised Instance Segmentation for Sorghum Stomatal Components

链接https://arxiv.org/abs/2602.00703

作者:Zhongtian Huang,Zhi Chen,Zi Huang,Xin Yu,Daniel Smith,Chaitanya Purushothama,Erik Van Oosterom,Alex Wu,William Salter,Yan Li,Scott Chapman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:globally important cereal, important cereal grown, cereal grown widely, stress-prone regions, globally important

备注

点击查看摘要

Abstract:Sorghum is a globally important cereal grown widely in water-limited and stress-prone regions. Its strong drought tolerance makes it a priority crop for climate-resilient agriculture. Improving water-use efficiency in sorghum requires precise characterisation of stomatal traits, as stomata control of gas exchange, transpiration and photosynthesis have a major influence on crop performance. Automated analysis of sorghum stomata is difficult because the stomata are small (often less than 40 $\mu$m in length in grasses such as sorghum) and vary in shape across genotypes and leaf surfaces. Automated segmentation contributes to high-throughput stomatal phenotyping, yet current methods still face challenges related to nested small structures and annotation bottlenecks. In this paper, we propose a semi-supervised instance segmentation framework tailored for analysis of sorghum stomatal components. We collect and annotate a sorghum leaf imagery dataset containing 11,060 human-annotated patches, covering the three stomatal components (pore, guard cell and complex area) across multiple genotypes and leaf surfaces. To improve the detection of tiny structures, we split high-resolution microscopy images into overlapping small patches. We then apply a pseudo-labelling strategy to unannotated images, producing an additional 56,428 pseudo-labelled patches. Benchmarking across semantic and instance segmentation models shows substantial performance gains: for semantic models the top mIoU increases from 65.93% to 70.35%, whereas for instance models the top AP rises from 28.30% to 46.10%. These results demonstrate that combining patch-based preprocessing with semi-supervised learning significantly improves the segmentation of fine stomatal structures. The proposed framework supports scalable extraction of stomatal traits and facilitates broader adoption of AI-driven phenotyping in crop science.

213. 【2602.00702】JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

链接https://arxiv.org/abs/2602.00702

作者:Ruikui Wang,Jinheng Feng,Lang Tian,Huaishao Luo,Chaochao Li,Liangbo Zhou,Huan Zhang,Youzheng Wu,Xiaodong He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing video avatar, demonstrated impressive capabilities, public speaking, Existing video, demonstrated impressive

备注

点击查看摘要

Abstract:Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on this https URL.

214. 【2602.00701】Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

链接https://arxiv.org/abs/2602.00701

作者:Mohamed Saleh,Zahra Ahmadi

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Effective multimodal fusion, complex cross-modal dependencies, remaining computationally scalable, Effective multimodal, fusion

备注

点击查看摘要

Abstract:Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.

215. 【2602.00687】V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication

链接https://arxiv.org/abs/2602.00687

作者:Yuankun Zeng,Shaohui Li,Zhi Li,Shulan Ruan,Yu Liu,You He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Collaborative perception improves, fusing multi-agent observations, intermediate-feature sharing faces, sharing faces strict, faces strict bandwidth

备注

点击查看摘要

Abstract:Collaborative perception improves 3D understanding by fusing multi-agent observations, yet intermediate-feature sharing faces strict bandwidth constraints as dense BEV features saturate V2X links. We observe that collaborators view the same physical world, making their features strongly correlated; thus receivers only need innovation beyond their local context. Revisiting this from a distributed source coding perspective, we propose V2X-DSC, a framework with a Conditional Codec (DCC) for bandwidth-constrained fusion. The sender compresses BEV features into compact codes, while the receiver performs conditional reconstruction using its local features as side information, allocating bits to complementary cues rather than redundant content. This conditional structure regularizes learning, encouraging incremental representation and yielding lower-noise features. Experiments on DAIR-V2X, OPV2V, and V2X-Real demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication, and generalizes as a plug-and-play communication layer across multiple fusion backbones.

216. 【2602.00683】Video Understanding: Through A Temporal Lens

链接https://arxiv.org/abs/2602.00683

作者:Thong Thanh Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advance video understanding, State Space Layers, large vision-language models, leverage temporal relations, upscaled video understanding

备注: PhD Thesis, NUS, 2025

点击查看摘要

Abstract:This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.

217. 【2602.00671】HPC: Hierarchical Point-based Latent Representation for Streaming Dynamic Gaussian Splatting Compression

链接https://arxiv.org/abs/2602.00671

作者:Yangzhi Ma,Bojun Liu,Wenting Liao,Dong Liu,Zhu Li,Li Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic Gaussian Splatting, Gaussian Splatting, streaming dynamic Gaussian, small memory footprint, Gaussian Splatting compression

备注

点击查看摘要

Abstract:While dynamic Gaussian Splatting has driven significant advances in free-viewpoint video, maintaining its rendering quality with a small memory footprint for efficient streaming transmission still presents an ongoing challenge. Existing streaming dynamic Gaussian Splatting compression methods typically leverage a latent representation to drive the neural network for predicting Gaussian residuals between frames. Their core latent representations can be categorized into structured grid-based and unstructured point-based paradigms. However, the former incurs significant parameter redundancy by inevitably modeling unoccupied space, while the latter suffers from limited compactness as it fails to exploit local correlations. To relieve these limitations, we propose HPC, a novel streaming dynamic Gaussian Splatting compression framework. It employs a hierarchical point-based latent representation that operates on a per-Gaussian basis to avoid parameter redundancy in unoccupied space. Guided by a tailored aggregation scheme, these latent points achieve high compactness with low spatial redundancy. To improve compression efficiency, we further undertake the first investigation to compress neural networks for streaming dynamic Gaussian Splatting through mining and exploiting the inter-frame correlation of parameters. Combined with latent compression, this forms a fully end-to-end compression framework. Comprehensive experimental evaluations demonstrate that HPC substantially outperforms state-of-the-art methods. It achieves a storage reduction of 67% against its baseline while maintaining high reconstruction fidelity.

218. 【2602.00669】Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation

链接https://arxiv.org/abs/2602.00669

作者:Marina Crespo Aguirre,Jonathan Williams-Ramirez,Dina Zemlyanker,Xiaoling Hu,Lucas J. Deden-Binder,Rogeny Herisse,Mark Montine,Theresa R. Connors,Christopher Mount,Christine L. MacDonald,C. Dirk Keene,Caitlin S. Latimer,Derek H. Oakley,Bradley T. Hyman,Ana Lawry Aguila,Juan Eugenio Iglesias

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词:Neuropathological analyses benefit, improve morphometric accuracy, spatially precise volumetric, Neuropathological analyses, precise volumetric reconstructions

备注: 12 pages of main content, 5 pages of supplement

点击查看摘要

Abstract:Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at this https URL

219. 【2602.00661】Schrödinger-Inspired Time-Evolution for 4D Deformation Forecasting

链接https://arxiv.org/abs/2602.00661

作者:Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Elke Ruth Gizewski,Astrid Ellen Grams

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex three-dimensional phenomena, three-dimensional phenomena, fluid and material, complex three-dimensional, medical imaging applications

备注

点击查看摘要

Abstract:Spatiotemporal forecasting of complex three-dimensional phenomena (4D: 3D + time) is fundamental to applications in medical imaging, fluid and material dynamics, and geophysics. In contrast to unconstrained neural forecasting models, we propose a Schrödinger-inspired, physics-guided neural architecture that embeds an explicit time-evolution operator within a deep convolutional framework for 4D prediction. From observed volumetric sequences, the model learns voxelwise amplitude, phase, and potential fields that define a complex-valued wavefunction $\psi = A e^{i\phi}$, which is evolved forward in time using a differentiable, unrolled Schrödinger time stepper. This physics-guided formulation yields several key advantages: (i) temporal stability arising from the structured evolution operator, which mitigates drift and error accumulation in long-horizon forecasting; (ii) an interpretable latent representation, where phase encodes transport dynamics, amplitude captures structural intensity, and the learned potential governs spatiotemporal interactions; and (iii) natural compatibility with deformation-based synthesis, which is critical for preserving anatomical fidelity in medical imaging applications. By integrating physical priors directly into the learning process, the proposed approach combines the expressivity of deep networks with the robustness and interpretability of physics-based modeling. We demonstrate accurate and stable prediction of future 4D states, including volumetric intensities and deformation fields, on synthetic benchmarks that emulate realistic shape deformations and topological changes. To our knowledge, this is the first end-to-end 4D neural forecasting framework to incorporate a Schrödinger-type evolution operator, offering a principled pathway toward interpretable, stable, and anatomically consistent spatiotemporal prediction.

220. 【2602.00653】Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

链接https://arxiv.org/abs/2602.00653

作者:Lukas Kuhn,Giuseppe Serra,Florian Buettner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:CLIP require large, large batch sizes, require large batch, approaches like CLIP, CLIP require

备注

点击查看摘要

Abstract:Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.

221. 【2602.00650】A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation

链接https://arxiv.org/abs/2602.00650

作者:Mohammadreza Gholipour Shahraki,Mehdi Rezaeian,Mohammad Ghasemzadeh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frozen SAM encoder, Mamba-based State Space, frozen SAM, treatment planning, essential for clinical

备注

点击查看摘要

Abstract:Accurate segmentation of 3D medical images such as MRI and CT is essential for clinical diagnosis and treatment planning. Foundation models like the Segment Anything Model (SAM) provide powerful general-purpose representations but struggle in medical imaging due to domain shift, their inherently 2D design, and the high computational cost of fine-tuning. To address these challenges, we propose Mamba-SAM, a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs). We investigate two parameter-efficient adaptation strategies. The first is a dual-branch architecture that explicitly fuses general features from a frozen SAM encoder with domain-specific representations learned by a trainable VMamba encoder using cross-attention. The second is an adapter-based approach that injects lightweight, 3D-aware Tri-Plane Mamba (TPMamba) modules into the frozen SAM ViT encoder to implicitly model volumetric context. Within this framework, we introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating. Extensive experiments on the ACDC cardiac MRI dataset demonstrate the effectiveness of the proposed methods. The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventricle (0.971) segmentation. The adapter-based TP MFGC variant offers superior inference speed (4.77 FPS) with strong accuracy (0.880 Dice). These results show that hybridizing foundation models with efficient SSM-based architectures provides a practical and effective solution for 3D medical image segmentation.

222. 【2602.00639】Diff-PC: Identity-preserving and 3D-aware Controllable Diffusion for Zero-shot Portrait Customization

链接https://arxiv.org/abs/2602.00639

作者:Yifang Xu,Benxiang Zhai,Chenyu Zhang,Ming Li,Yang Li,Sidan Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently garnered significant, garnered significant attention, significant attention due, potential applications, recently garnered

备注: Accepted by Information Fusion 2025

点击查看摘要

Abstract:Portrait customization (PC) has recently garnered significant attention due to its potential applications. However, existing PC methods lack precise identity (ID) preservation and face control. To address these tissues, we propose Diff-PC, a diffusion-based framework for zero-shot PC, which generates realistic portraits with high ID fidelity, specified facial attributes, and diverse backgrounds. Specifically, our approach employs the 3D face predictor to reconstruct the 3D-aware facial priors encompassing the reference ID, target expressions, and poses. To capture fine-grained face details, we design ID-Encoder that fuses local and global facial features. Subsequently, we devise ID-Ctrl using the 3D face to guide the alignment of ID features. We further introduce ID-Injector to enhance ID fidelity and facial controllability. Finally, training on our collected ID-centric dataset improves face similarity and text-to-image (T2I) alignment. Extensive experiments demonstrate that Diff-PC surpasses state-of-the-art methods in ID preservation, facial control, and T2I consistency. Furthermore, our method is compatible with multi-style foundation models.

223. 【2602.00637】VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning

链接https://arxiv.org/abs/2602.00637

作者:Vivek Madhavaram,Vartika Sengar,Arkadipta De,Charu Sharma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, scene graph generation, scene graph, Scene, fundamental problem

备注: WACV 2026, Project page: [this https URL](https://vivekmadhavaram.github.io/vizor/)

点击查看摘要

Abstract:Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object's front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.

224. 【2602.00635】S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning

链接https://arxiv.org/abs/2602.00635

作者:Lingsong Wang,Mancheng Meng,Ziyan Wu,Terrence Chen,Fan Yang,Dinggang Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Existing face parsing, face parsing methods, Existing face, parsing methods, methods usually misclassify

备注

点击查看摘要

Abstract:Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators' ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models' (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT's superior performance and the effectiveness of each module.

225. 【2602.00627】FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization

链接https://arxiv.org/abs/2602.00627

作者:Benxiang Zhai,Yifang Xu,Guofeng Zhang,Yang Li,Sidan Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:great strides recently, made great strides, strides recently, significant advancements, made great

备注: Accept by ICANN 2025

点击查看摘要

Abstract:Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.

226. 【2602.00621】owards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

链接https://arxiv.org/abs/2602.00621

作者:Guangtao Lyu,Xinyi Cheng,Qi Liu,Chenghao Xu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:LVLMs achieve remarkable, achieve remarkable multimodal, achieve remarkable, neurons, image-specific neurons

备注

点击查看摘要

Abstract:LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.

227. 【2602.00618】une-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting

链接https://arxiv.org/abs/2602.00618

作者:Yian Zhao,Rushi Ye,Ruochong Zheng,Zesen Cheng,Chaoran Feng,Jiashu Yang,Pengchong Qiao,Chang Liu,Jie Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reference style images, style transfer, style transfer refers, style, assets based

备注: ICCV 2025

点击查看摘要

Abstract:3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed \textbf{Tune-Your-Style}, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized views and zero-style guidance from the initial rendering. Extensive experiments demonstrate that our method not only delivers visually appealing results, but also exhibits flexible customizability for 3D style transfer. Project page is available at this https URL.

228. 【2602.00593】From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking

链接https://arxiv.org/abs/2602.00593

作者:Yifan Jiang,Cong Zhang,Bofei Zhang,Yifan Yang,Bingzhang Wang,Yew-Soon Ong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deliberate knowledge-based reasoning, general tasks, skills separately, progress on general, synergy not captured

备注

点击查看摘要

Abstract:Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.

229. 【2602.00583】MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation

链接https://arxiv.org/abs/2602.00583

作者:Xiangdong Li,Ye Lou,Ao Gao,Wei Zhang,Siyang Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:precise Action Unit, Action Unit, demographically diverse face, recognition systems, occurrence and intensity

备注

点击查看摘要

Abstract:The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.

230. 【2602.00579】Bridging Degradation Discrimination and Generation for Universal Image Restoration

链接https://arxiv.org/abs/2602.00579

作者:JiaKui Hu,Zhengjian Yao,Lujia Jin,Yanye Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Universal image restoration, produce clean images, Universal image, low-level vision, produce clean

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in this https URL.

231. 【2602.00573】When Classes Evolve: A Benchmark and Framework for Stage-Aware Class-Incremental Learning

链接https://arxiv.org/abs/2602.00573

作者:Zheng Zhang,Tao Hu,Xueheng Li,Yang Wang,Rui Li,Jie Zhang,Chengjun Xie

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Class-Incremental Learning, previously learned knowledge, mitigating catastrophic forgetting, aims to sequentially, mitigating catastrophic

备注

点击查看摘要

Abstract:Class-Incremental Learning (CIL) aims to sequentially learn new classes while mitigating catastrophic forgetting of previously learned knowledge. Conventional CIL approaches implicitly assume that classes are morphologically static, focusing primarily on preserving previously learned representations as new classes are introduced. However, this assumption neglects intra-class evolution: a phenomenon wherein instances of the same semantic class undergo significant morphological transformations, such as a larva turning into a butterfly. Consequently, a model must both discriminate between classes and adapt to evolving appearances within a single class. To systematically address this challenge, we formalize Stage-Aware CIL (Stage-CIL), a paradigm in which each class is learned progressively through distinct morphological stages. To facilitate rigorous evaluation within this paradigm, we introduce the Stage-Bench, a 10-domain, 2-stages dataset and protocol that jointly measure inter- and intra-class forgetting. We further propose STAGE, a novel method that explicitly learns abstract and transferable evolution patterns within a fixed-size memory pool. By decoupling semantic identity from transformation dynamics, STAGE enables accurate prediction of future morphologies based on earlier representations. Extensive empirical evaluation demonstrates that STAGE consistently and substantially outperforms existing state-of-the-art approaches, highlighting its effectiveness in simultaneously addressing inter-class discrimination and intra-class morphological adaptation.

232. 【2602.00570】GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates

链接https://arxiv.org/abs/2602.00570

作者:Xingyu Luo,Yidong Cai,Jie Liu,Jie Tang,Gangshan Wu,Limin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained increasing attention, gained increasing, increasing attention, Vision-language tracking, Vision-language

备注

点击查看摘要

Abstract:Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: this https URL

233. 【2602.00559】Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

链接https://arxiv.org/abs/2602.00559

作者:Wenbin Xing,Quanxing Zha,Lizheng Zu,Mengran Li,Ming Li,Junchi Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:factors largely underexplored, mitigation primarily focuses, multiple interacting spatial, temporal factors largely, Current research

备注

点击查看摘要

Abstract:Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., "All are correct" and "None of the above") to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at this https URL.

234. 【2602.00551】APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

链接https://arxiv.org/abs/2602.00551

作者:Daoxuan Zhang,Ping Chen,Xiaobo Xia,Xiu Su,Ruichen Zhen,Jianqiang Xiao,Shuo Yang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Object Goal Navigation, Unmanned Aerial Vehicle, Aerial Object Goal, Goal Navigation, Object Goal

备注: 15 pages, 8 figures

点击查看摘要

Abstract:Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbf{APEX} (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2\% SR and +2.8\% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \href{this https URL}{GitHub}

235. 【2602.00542】NPNet: A Non-Parametric Network with Adaptive Gaussian-Fourier Positional Encoding for 3D Classification and Segmentation

链接https://arxiv.org/abs/2602.00542

作者:Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari,Mert D. Pesé

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:fully non-parametric approach, point-cloud classification, classification and part, part segmentation, present NPNet

备注: Accepted to the 2026 IEEE Intelligent Vehicles Symposium (IV 2026)

点击查看摘要

Abstract:We present NPNet, a fully non-parametric approach for 3D point-cloud classification and part segmentation. NPNet contains no learned weights; instead, it builds point features using deterministic operators such as farthest point sampling, k-nearest neighbors, and pooling. Our key idea is an adaptive Gaussian-Fourier positional encoding whose bandwidth and Gaussian-cosine mixing are chosen from the input geometry, helping the method remain stable across different scales and sampling densities. For segmentation, we additionally incorporate fixed-frequency Fourier features to provide global context alongside the adaptive encoding. Across ModelNet40/ModelNet-R, ScanObjectNN, and ShapeNetPart, NPNet achieves strong performance among non-parametric baselines, and it is particularly effective in few-shot settings on ModelNet40. NPNet also offers favorable memory use and inference time compared to prior non-parametric methods

236. 【2602.00536】SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal

链接https://arxiv.org/abs/2602.00536

作者:Yifan Zhang,Qian Chen,Yi Liu,Wengen Li,Jihong Guan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth observation tasks, downstream Earth observation, contamination severely degrades, downstream Earth, Earth observation

备注

点击查看摘要

Abstract:Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at this https URL.

237. 【2602.00531】Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment

链接https://arxiv.org/abs/2602.00531

作者:Tianyi Zhang,Antoine Simoulin,Kai Li,Sana Lakdawala,Shiqing Yu,Arpit Mittal,Hongyu Fu,Yu Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional object detection, Traditional object, limiting their applicability, dynamic environments, systems are typically

备注

点击查看摘要

Abstract:Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.

238. 【2602.00523】SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

链接https://arxiv.org/abs/2602.00523

作者:Yujia Tong,Tian Zhang,Yunyang Wan,Kaiwei Lin,Jingling Yuan,Chuang Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling parallel verification, multiple draft tokens, Speculative decoding, vision-language models, draft tokens

备注

点击查看摘要

Abstract:Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.

239. 【2602.00522】MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval

链接https://arxiv.org/abs/2602.00522

作者:Chaoran Xu,Chengkan Lv,Qiyu Chen,Feng Zhang,Zhengtao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Zero-shot anomaly detection, limited cross-domain stability, leverages pretrained vision, Anomaly Detection method, Zero-shot anomaly

备注

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code will be publicly released at this https URL.

240. 【2602.00516】SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation

链接https://arxiv.org/abs/2602.00516

作者:Kunal Mahatha,Jose Dolz,Christian Desrosiers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limiting assumption, diffusion-derived affinities, spectral graph partitioning, argue that existing, implicit and limiting

备注

点击查看摘要

Abstract:We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.

241. 【2602.00508】DuoGen: Towards General Purpose Interleaved Multimodal Generation

链接https://arxiv.org/abs/2602.00508

作者:Min Shi,Xiaohui Zeng,Jiannan Huang,Yin Cui,Francesco Ferroni,Jialuo Li,Shubham Pachori,Zhaoshuo Li,Yogesh Balaji,Haoxiang Wang,Tsung-Yi Lin,Xiao Fu,Yue Zhao,Chieh-Yun Chen,Ming-Yu Liu,Humphrey Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating visual drafts, instructional guides, drafts for reasoning, generation enables capabilities, multimodal generation enables

备注: Technical Report. Project Page: [this https URL](https://research.nvidia.com/labs/dir/duetgen/)

点击查看摘要

Abstract:Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at this https URL.

242. 【2602.00505】Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

链接https://arxiv.org/abs/2602.00505

作者:Jingrui Zhang,Feng Liang,Yong Zhang,Wei Wang,Runhao Zeng,Xiping Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large language models, large language, natural language understanding, multimodal large language, language models

备注

点击查看摘要

Abstract:With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model's ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.

243. 【2602.00504】RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

链接https://arxiv.org/abs/2602.00504

作者:Jiahe Wu,Bing Cao,Qilong Wang,Qinghua Hu,Dongdong Li,Pengfei Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs' RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.

244. 【2602.00490】HSSDCT: Factorized Spatial-Spectral Correlation for Hyperspectral Image Fusion

链接https://arxiv.org/abs/2602.00490

作者:Chia-Ming Lee,Yu-Hao Ho,Yu-Fan Lin,Jen-Wei Lee,Li-Wei Kang,Chih-Chung Hsu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution multispectral image, Hyperspectral image, rich spectral information, fine spatial details, multispectral image

备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Hyperspectral image (HSI) fusion aims to reconstruct a high-resolution HSI (HR-HSI) by combining the rich spectral information of a low-resolution HSI (LR-HSI) with the fine spatial details of a high-resolution multispectral image (HR-MSI). Although recent deep learning methods have achieved notable progress, they still suffer from limited receptive fields, redundant spectral bands, and the quadratic complexity of self-attention, which restrict both efficiency and robustness. To overcome these challenges, we propose the Hierarchical Spatial-Spectral Dense Correlation Network (HSSDCT). The framework introduces two key modules: (i) a Hierarchical Dense-Residue Transformer Block (HDRTB) that progressively enlarges windows and employs dense-residue connections for multi-scale feature aggregation, and (ii) a Spatial-Spectral Correlation Layer (SSCL) that explicitly factorizes spatial and spectral dependencies, reducing self-attention to linear complexity while mitigating spectral redundancy. Extensive experiments on benchmark datasets demonstrate that HSSDCT delivers superior reconstruction quality with significantly lower computational costs, achieving new state-of-the-art performance in HSI fusion. Our code is available at this https URL.

245. 【2602.00489】Refining Strokes by Learning Offset Attributes between Strokes for Flexible Sketch Edit at Stroke-Level

链接https://arxiv.org/abs/2602.00489

作者:Sicong Zang,Tao Sun,Cairong Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:source stroke, preserving semantic consistency, transplant source strokes, source, Sketch

备注: Source codes are coming soon

点击查看摘要

Abstract:Sketch edit at stroke-level aims to transplant source strokes onto a target sketch via stroke expansion or replacement, while preserving semantic consistency and visual fidelity with the target sketch. Recent studies addressed it by relocating source strokes at appropriate canvas positions. However, as source strokes could exhibit significant variations in both size and orientation, we may fail to produce plausible sketch editing results by merely repositioning them without further adjustments. For example, anchoring an oversized source stroke onto the target without proper scaling would fail to produce a semantically coherent outcome. In this paper, we propose SketchMod to refine the source stroke through transformation so as to align it with the target sketch's patterns, further realize flexible sketch edit at stroke-level. As the source stroke refinement is governed by the patterns of the target sketch, we learn three key offset attributes (scale, orientation and position) from the source stroke to another, and align it with the target by: 1) resizing to match spatial proportions by scale, 2) rotating to align with local geometry by orientation, and 3) displacing to meet with semantic layout by position. Besides, a stroke's profiles can be precisely controlled during sketch edit via the exposed captured stroke attributes. Experimental results indicate that SketchMod achieves precise and flexible performances on stroke-level sketch edit.

246. 【2602.00484】GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association

链接https://arxiv.org/abs/2602.00484

作者:Rong-Lin Jian,Ming-Chi Luo,Chen-Wei Huang,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:irregular player motion, highly challenging due, uniform appearances, player motion, sports is highly

备注: Winner Solution of SoccerTrack in ACM Multimedia 2025 Workshop MMSports

点击查看摘要

Abstract:Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at this https URL.

247. 【2602.00471】Dual Latent Memory for Visual Multi-agent System

链接https://arxiv.org/abs/2602.00471

作者:Xinlei Yu,Chengming Xu,Zhangquan Chen,Bo Yin,Cheng Yang,Yongbo He,Yihao Hu,Jiangning Zhang,Cheng Tan,Xiaobin Hu,Shuicheng Yan

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Multi-Agent Systems, empirical evidence reveals, increasing agent turns, enhance comprehensive abilities, inflating token costs

备注

点击查看摘要

Abstract:While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L$^{2}$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: this https URL.

248. 【2602.00470】ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation

链接https://arxiv.org/abs/2602.00470

作者:Pengyu Chen,Fangzheng Lyu,Sicheng Wang,Cuizhen Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:forest biomass estimation, Individual tree crown, Individual tree, ecological monitoring, remote sensing

备注

点击查看摘要

Abstract:Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose ZS-TreeSeg, a Zero-Shot framework that adapts from two mature tasks: 1) Canopy Semantic segmentation; and 2) Cells instance segmentation. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the ZS-TreeSeg framework forces the mathematical separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.

249. 【2602.00463】PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting

链接https://arxiv.org/abs/2602.00463

作者:Xin Zhang,Shen Chen,Jiale Zhou,Lei Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating realistic, Generating, Abstract, immersive applications, scenes

备注: Accepted to ICASSP2026

点击查看摘要

Abstract:Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.

250. 【2602.00462】LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

链接https://arxiv.org/abs/2602.00462

作者:Benno Krojer,Shravan Nayak,Oscar Mañas,Vaibhav Adlakha,Desmond Elliott,Siva Reddy,Marius Mosbach

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Transforming a large, visual tokens, visual, embedding space, visual token

备注

点击查看摘要

Abstract:Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.

251. 【2602.00450】Model Optimization for Multi-Camera 3D Detection and Tracking

链接https://arxiv.org/abs/2602.00450

作者:Ethan Anderson(1),Justin Silva(1),Kyle Zheng(1),Sameer Pusegaonkar(2),Yizhou Wang(2),Zheng Tang(2),Sujit Biswas(2) ((1) Clemson University, (2) NVIDIA)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Outside-in multi-camera perception, support multi-target tracking, Outside-in multi-camera, indoor environments, heterogeneous viewpoints

备注

点击查看摘要

Abstract:Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

252. 【2602.00440】DISK: Dynamic Inference SKipping for World Models

链接https://arxiv.org/abs/2602.00440

作者:Anugunj Naman,Gaibo Zhang,Ayushman Singh,Yaguang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:training-free adaptive inference, adaptive inference method, autoregressive world models, world models, training-free adaptive

备注

点击查看摘要

Abstract:We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.

253. 【2602.00420】xt is All You Need for Vision-Language Model Jailbreaking

链接https://arxiv.org/abs/2602.00420

作者:Yihang Chen,Zhao Xu,Youyuan Jiang,Tianle Zheng,Cho-Jui Hsieh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Large Vision-Language Models, Large Vision-Language, Optical Character Recognition, model Optical Character, increasingly equipped

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.

254. 【2602.00414】oward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure

链接https://arxiv.org/abs/2602.00414

作者:Trishna Chakraborty,Udita Ghosh,Aldair Ernesto Gongora,Ruben Glatt,Yue Dong,Jiachen Li,Amit K. Roy-Chowdhury,Chengyu Song

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:minor unsafe actions, Laboratories are prone, pre-lab safety training, continuous safety monitoring, mandatory pre-lab safety

备注

点击查看摘要

Abstract:Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring -- beyond mandatory pre-lab safety training -- is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.

255. 【2602.00395】3DGS$^2$-TR: Scalable Second-Order Trust-Region Method for 3D Gaussian Splatting

链接https://arxiv.org/abs/2602.00395

作者:Roger Hsiao,Yuchen Fang,Xiangru Huang,Ruilong Li,Hesam Rabeti,Zan Gojcic,Javad Lavaei,James Demmel,Sophia Shao

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)

关键词:TR,a second-order optimizer, Gaussian Splatting, optimizer for accelerating, TR,a second-order, scene training problem

备注

点击查看摘要

Abstract:We propose 3DGS$^2$-TR,a second-order optimizer for accelerating the scene training problem in 3D Gaussian Splatting (3DGS). Unlike existing second-order approaches that rely on explicit or dense curvature representations, such as 3DGS-LM (Höllein et al., 2025) or 3DGS2 (Lan et al., 2025), our method approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson's method. Our approach is fully matrix-free and has the same complexity as ADAM (Kingma, 2024), $O(n)$ in both computation and memory costs. To ensure stable optimization in the presence of strong nonlinearity in the 3DGS rasterization process, we introduce a parameter-wise trust-region technique based on the squared Hellinger distance, regularizing updates to Gaussian parameters. Under identical parameter initialization and without densification, 3DGS$^2$-TR is able to achieve better reconstruction quality on standard datasets, using 50% fewer training iterations compared to ADAM, while incurring less than 1GB of peak GPU memory overhead (17% more than ADAM and 85% less than 3DGS-LM), enabling scalability to very large scenes and potentially to distributed training settings.

256. 【2602.00394】Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences

链接https://arxiv.org/abs/2602.00394

作者:Manoj Reddy Bethi,Sai Rupa Jhade,Pravallika Yaganti,Monoshiz Mahbub Khan,Zhe Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:obtaining labeled data, visual art presents, art presents significant, presents significant challenges, significant challenges due

备注

点击查看摘要

Abstract:Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to $328\%$ improvement in $R^2$. The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require $60\%$ less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.

257. 【2602.00393】Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

链接https://arxiv.org/abs/2602.00393

作者:Gabriel Bromonschenkel,Alessandro L. Koerich,Thiago M. Paixão,Hilário Tomaz Alves de Oliveira

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:media content generation, social media content, Brazilian Portuguese, natural language descriptions, Image captioning

备注: Accepted to JBCS. 18 pages, 11 figures

点击查看摘要

Abstract:Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: this https URL.

258. 【2602.00391】Robust automatic brain vessel segmentation in 3D CTA scans using dynamic 4D-CTA data

链接https://arxiv.org/abs/2602.00391

作者:Alberto Mario Ceballos-Arroyo,Shrikanth M. Yadav,Chu-Hsuan Lin,Jisoo Kim,Geoffrey S. Young,Huaizu Jiang,Lei Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:head scans, dynamic CTA acquisitions, methodology for annotating, brain vasculature, dynamic CTA

备注: 16 pages, 8 figures

点击查看摘要

Abstract:In this study, we develop a novel methodology for annotating the brain vasculature using dynamic 4D-CTA head scans. By using multiple time points from dynamic CTA acquisitions, we subtract bone and soft tissue to enhance the visualization of arteries and veins, reducing the effort required to obtain manual annotations of brain vessels. We then train deep learning models on our ground truth annotations by using the same segmentation for multiple phases from the dynamic 4D-CTA collection, effectively enlarging our dataset by 4 to 5 times and inducing robustness to contrast phases. In total, our dataset comprises 110 training images from 25 patients and 165 test images from 14 patients. In comparison with two similarly-sized datasets for CTA-based brain vessel segmentation, a nnUNet model trained on our dataset can achieve significantly better segmentations across all vascular regions, with an average mDC of 0.846 for arteries and 0.957 for veins in the TopBrain dataset. Furthermore, metrics such as average directed Hausdorff distance (adHD) and topology sensitivity (tSens) reflected similar trends: using our dataset resulted in low error margins (aDHD of 0.304 mm for arteries and 0.078 for veins) and high sensitivity (tSens of 0.877 for arteries and 0.974 for veins), indicating excellent accuracy in capturing vessel morphology. Our code and model weights are available online: this https URL

259. 【2602.00385】Deep Learning-Based Object Detection for Autonomous Vehicles: A Comparative Study of One-Stage and Two-Stage Detectors on Basic Traffic Objects

链接https://arxiv.org/abs/2602.00385

作者:Bsher Karbouj,Adam Michael Altenbuchner,Joerg Krueger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous vehicle systems, autonomous vehicle, Faster R-CNN, vehicle systems, crucial component

备注

点击查看摘要

Abstract:Object detection is a crucial component in autonomous vehicle systems. It enables the vehicle to perceive and understand its environment by identifying and locating various objects around it. By utilizing advanced imaging and deep learning techniques, autonomous vehicle systems can rapidly and accurately identify objects based on their features. Different deep learning methods vary in their ability to accurately detect and classify objects in autonomous vehicle systems. Selecting the appropriate method significantly impacts system performance, robustness, and efficiency in real-world driving scenarios. While several generic deep learning architectures like YOLO, SSD, and Faster R-CNN have been proposed, guidance on their suitability for specific autonomous driving applications is often limited. The choice of method affects detection accuracy, processing speed, environmental robustness, sensor integration, scalability, and edge case handling. This study provides a comprehensive experimental analysis comparing two prominent object detection models: YOLOv5 (a one-stage detector) and Faster R-CNN (a two-stage detector). Their performance is evaluated on a diverse dataset combining real and synthetic images, considering various metrics including mean Average Precision (mAP), recall, and inference speed. The findings reveal that YOLOv5 demonstrates superior performance in terms of mAP, recall, and training efficiency, particularly as dataset size and image resolution increase. However, Faster R-CNN shows advantages in detecting small, distant objects and performs well in challenging lighting conditions. The models' behavior is also analyzed under different confidence thresholds and in various real-world scenarios, providing insights into their applicability for autonomous driving systems.

260. 【2602.00381】Modeling Image-Caption Rating from Comparative Judgments

链接https://arxiv.org/abs/2602.00381

作者:Kezia Minni,Qiang Zhang,Monoshiz Mahbub Khan,Zhe Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:time-consuming and subjective, model, describing images, comparative, comparative learning

备注

点击查看摘要

Abstract:Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson's $\rho$: 0.7609 and Spearman's $r_s$: 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.

261. 【2602.00350】ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

链接https://arxiv.org/abs/2602.00350

作者:Ignacy Kolton,Kacper Marzol,Paweł Batorski,Marcin Mazur,Paul Swoboda,Przemysław Spurek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:key defense mechanism, recent evidence shows, removing unauthorized concepts, Machine unlearning, key defense

备注

点击查看摘要

Abstract:Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at this https URL

262. 【2602.00348】MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI

链接https://arxiv.org/abs/2602.00348

作者:Zhengyi Lu,Ming Lu,Chongyu Qu,Junchao Zhu,Junlin Guo,Marilyn Lionts,Yanfan Zhu,Yuechen Yang,Tianyuan Yao,Jayasai Rajagopal,Bennett Allan Landman,Xiao Wang,Xinqiang Yan,Yuankai Huo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hinder clinical diagnosis, MRI, degrade image quality, Metal implants, degrade image

备注

点击查看摘要

Abstract:Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC's learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: this https URL

263. 【2602.00347】AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning

链接https://arxiv.org/abs/2602.00347

作者:Chongyu Qu,Zhengyi Lu,Yuxiang Lai,Thomas Z. Li,Junchao Zhu,Junlin Guo,Juming Xiong,Yanfan Zhu,Yuechen Yang,Allen J. Luna,Kim L. Sandler,Bennett A. Landman,Yuankai Huo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:heterogeneous data sources, integrating complementary information, clinical records, diagnosis and prognosis, integrating complementary

备注

点击查看摘要

Abstract:Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.

264. 【2602.00344】When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

链接https://arxiv.org/abs/2602.00344

作者:Beidi Zhao,Wenlong Deng,Xinting Liao,Yushu Li,Nazim Shaikh,Yao Nie,Xiaoxiao Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:knowledge-based VQA tasks, enhancing Large Vision-Language, Large Vision-Language Models, recent work attributes, VQA tasks

备注: 18 pages, 10 figures

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

265. 【2602.00340】Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception

链接https://arxiv.org/abs/2602.00340

作者:Alexandros Christoforos,Sarah Jenkins,Michael Brown,Tuan Pham,David Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Neural Agents Network, Synergistic Neural Agents, pioneering Synergistic Neural, cross-modal alignment degeneration, Agents Network

备注

点击查看摘要

Abstract:This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.

266. 【2602.00314】On the Assessment of Sensitivity of Autonomous Vehicle Perception

链接https://arxiv.org/abs/2602.00314

作者:Apostol Vassilev,Munawar Hasan,Edward Griffor,Honglan Jin,Pavel Piliptchak,Mahima Arora,Thoshitha Gamage

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:provide real-time accurate, perception, decision-making and maneuvers, heavily dependent, information for robust

备注: 21 pages, 17 figures

点击查看摘要

Abstract:The viability of automated driving is heavily dependent on the performance of perception systems to provide real-time accurate and reliable information for robust decision-making and maneuvers. These systems must perform reliably not only under ideal conditions, but also when challenged by natural and adversarial driving factors. Both of these types of interference can lead to perception errors and delays in detection and classification. Hence, it is essential to assess the robustness of the perception systems of automated vehicles (AVs) and explore strategies for making perception more reliable. We approach this problem by evaluating perception performance using predictive sensitivity quantification based on an ensemble of models, capturing model disagreement and inference variability across multiple models, under adverse driving scenarios in both simulated environments and real-world conditions. A notional architecture for assessing perception performance is proposed. A perception assessment criterion is developed based on an AV's stopping distance at a stop sign on varying road surfaces, such as dry and wet asphalt, and vehicle speed. Five state-of-the-art computer vision models are used, including YOLO (v8-v9), DEtection TRansformer (DETR50, DETR101), Real-Time DEtection TRansformer (RT-DETR)in our experiments. Diminished lighting conditions, e.g., resulting from the presence of fog and low sun altitude, have the greatest impact on the performance of the perception models. Additionally, adversarial road conditions such as occlusions of roadway objects increase perception sensitivity and model performance drops when faced with a combination of adversarial road conditions and inclement weather conditions. Also, it is demonstrated that the greater the distance to a roadway object, the greater the impact on perception performance, hence diminished perception robustness.

267. 【2602.00309】Opportunistic Promptable Segmentation: Leveraging Routine Radiological Annotations to Guide 3D CT Lesion Segmentation

链接https://arxiv.org/abs/2602.00309

作者:Samuel Church,Joshua D. Warner,Danyal Maqbool,Xin Tie,Junjie Hu,Meghan G. Lubner,Tyler J. Bradshaw

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:machine learning models, diverse annotated datasets, promptable segmentation, development of machine, machine learning

备注

点击查看摘要

Abstract:The development of machine learning models for CT imaging depends on the availability of large, high-quality, and diverse annotated datasets. Although large volumes of CT images and reports are readily available in clinical picture archiving and communication systems (PACS), 3D segmentations of critical findings are costly to obtain, typically requiring extensive manual annotation by radiologists. On the other hand, it is common for radiologists to provide limited annotations of findings during routine reads, such as line measurements and arrows, that are often stored in PACS as GSPS objects. We posit that these sparse annotations can be extracted along with CT volumes and converted into 3D segmentations using promptable segmentation models, a paradigm we term Opportunistic Promptable Segmentation. To enable this paradigm, we propose SAM2CT, the first promptable segmentation model designed to convert radiologist annotations into 3D segmentations in CT volumes. SAM2CT builds upon SAM2 by extending the prompt encoder to support arrow and line inputs and by introducing Memory-Conditioned Memories (MCM), a memory encoding strategy tailored to 3D medical volumes. On public lesion segmentation benchmarks, SAM2CT outperforms existing promptable segmentation models and similarly trained baselines, achieving Dice similarity coefficients of 0.649 for arrow prompts and 0.757 for line prompts. Applying the model to pre-existing GSPS annotations from a clinical PACS (N = 60), SAM2CT generates 3D segmentations that are clinically acceptable or require only minor adjustments in 87% of cases, as scored by radiologists. Additionally, SAM2CT demonstrates strong zero-shot performance on select Emergency Department findings. These results suggest that large-scale mining of historical GSPS annotations represents a promising and scalable approach for generating 3D CT segmentation datasets.

268. 【2602.00292】LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification

链接https://arxiv.org/abs/2602.00292

作者:Rory Driscoll,Alexandros Christoforos,Chadbourne Davis

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains insufficiently explored, evidence remains insufficiently, actual visual evidence, visual evidence remains, complex multimodal tasks

备注

点击查看摘要

Abstract:While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.

269. 【2602.00289】Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory

链接https://arxiv.org/abs/2602.00289

作者:Alan Yuille,Daniel Kersten

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bayes Decision Theory, Decision Theory, Bayes Decision, Cognitive Science, perspective of Bayes

备注

点击查看摘要

Abstract:This document presents an introduction to computer vision, and its relationship to Cognitive Science, from the perspective of Bayes Decision Theory (Berger 1985). Computer vision is a vast and complex field, so this overview has a narrow scope and provides a theoretical lens which captures many key concepts. BDT is rich enough to include two different approaches: (i) the Bayesian viewpoint, which gives a conceptually attractive framework for vision with concepts that resonate with Cognitive Science (Griffiths et al., 2024), and (ii) the Deep Neural Network approach whose successes in the real world have made Computer Vision into a trillion-dollar industry and which is motivated by the hierarchical structure of the visual ventral stream. The BDT framework relates and captures the strengths and weakness of these two approaches and, by discussing the limitations of BDT, points the way to how they can be combined in a richer framework.

270. 【2602.00288】meBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

链接https://arxiv.org/abs/2602.00288

作者:Baiqi Li,Kangyi Zhao,Ce Zhang,Chancharik Mitra,Jean de Dieu Nyandwi,Gedas Bertasius

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, Large Language, temporal

备注: For code and data, see [this https URL](https://baiqi-li.github.io/timeblind_project/)

点击查看摘要

Abstract:Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at this https URL .

271. 【2602.00268】okenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

链接https://arxiv.org/abs/2602.00268

作者:Ariel Shaulov,Eitan Shaar,Amit Edenzon,Lior Wolf

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enables long video, long video synthesis, video generation enables, video synthesis, synthesis by iteratively

备注

点击查看摘要

Abstract:Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.

272. 【2602.00267】PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

链接https://arxiv.org/abs/2602.00267

作者:Gemma Canet Tarrés,Manel Baradad,Francesc Moreno-Noguer,Yumeng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:dramatically improved photorealistic, Recent advances, photorealistic image synthesis, improved photorealistic image, advances in generative

备注

点击查看摘要

Abstract:Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.

273. 【2602.00265】World-Shaper: A Unified Framework for 360° Panoramic Editing

链接https://arxiv.org/abs/2602.00265

作者:Dong Liang,Yuhao Liu,Jinyuan Jia,Youjun Zhao,Rynson W.H.Lau

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:creating realistic, edit panoramic images, crucial for creating, visual experiences, existing perspective-based image

备注

点击查看摘要

Abstract:Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: this https URL

274. 【2602.00262】Subspace Clustering on Incomplete Data with Self-Supervised Contrastive Learning

链接https://arxiv.org/abs/2602.00262

作者:Huanran Li,Daniel Pimentel-Alarcón

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:finds wide application, hyperspectral imaging, group data points, computer vision, recommendation systems

备注

点击查看摘要

Abstract:Subspace clustering aims to group data points that lie in a union of low-dimensional subspaces and finds wide application in computer vision, hyperspectral imaging, and recommendation systems. However, most existing methods assume fully observed data, limiting their effectiveness in real-world scenarios with missing entries. In this paper, we propose a contrastive self-supervised framework, Contrastive Subspace Clustering (CSC), designed for clustering incomplete data. CSC generates masked views of partially observed inputs and trains a deep neural network using a SimCLR-style contrastive loss to learn invariant embeddings. These embeddings are then clustered using sparse subspace clustering. Experiments on six benchmark datasets show that CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.

275. 【2602.00249】SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis

链接https://arxiv.org/abs/2602.00249

作者:Rishav Pramanik,Ian E. Nielsen,Jeff Smith,Saurav Pandit,Ravi P. Ramachandran,Zhaozheng Yin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:unprecedented creative potential, unlocked unprecedented creative, faithfully render complex, spatial relationships remains, render complex prompts

备注

点击查看摘要

Abstract:The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate compositional adherence, unconstrained by a fixed vocabulary. Through extensive experiments on six state-of-the-art T2I models, we demonstrate that SANEval's automated evaluations provide a more faithful proxy for human assessment; our metric achieves a Spearman's rank correlation with statistically different results than those of existing benchmarks across tasks of attribute binding, spatial relations, and numeracy. To facilitate future research in compositional T2I generation and evaluation, we will release the SANEval dataset and our open-source evaluation pipeline.

276. 【2602.00247】CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models

链接https://arxiv.org/abs/2602.00247

作者:Samyak Jha,Junho Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Vision-Language Models, inference in Large, Large Vision-Language, cost of processing, processing thousands

备注

点击查看摘要

Abstract:Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency--performance trade-offs with improved robustness.

277. 【2602.00222】MapDream: Task-Driven Map Learning for Vision-Language Navigation

链接https://arxiv.org/abs/2602.00222

作者:Guoxin Lian,Shuo Wang,Yucheng Wang,Yongcai Wang,Maiyue Chen,Kaihui Wang,Bo Zhang,Zhizhong Su,Deying Li,Zhaoxin Fan

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:follow natural language, natural language instructions, aggregate spatial context, requires agents, partially observed

备注

点击查看摘要

Abstract:Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

278. 【2602.00216】Development of a Cacao Disease Identification and Management App Using Deep Learning

链接https://arxiv.org/abs/2602.00216

作者:Zaldy Pagaduan,Jason Occidental,Nathaniel Duro,Dexielito Badilles,Eleonor Palconit

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)

关键词:unlike larger plantations, outdated farming techniques, face significant challenges, unlike larger, resources and expertise

备注: 6 pages, 8 figures, preprint

点击查看摘要

Abstract:Smallholder cacao producers often rely on outdated farming techniques and face significant challenges from pests and diseases, unlike larger plantations with more resources and expertise. In the Philippines, cacao farmers have limited access to data, information, and good agricultural practices. This study addresses these issues by developing a mobile application for cacao disease identification and management that functions offline, enabling use in remote areas where farms are mostly located. The core of the system is a deep learning model trained to identify cacao diseases accurately. The trained model is integrated into the mobile app to support farmers in field diagnosis. The disease identification model achieved a validation accuracy of 96.93% while the model for detecting cacao black pod infection levels achieved 79.49% validation accuracy. Field testing of the application showed an agreement rate of 84.2% compared with expert cacao technician assessments. This approach empowers smallholder farmers by providing accessible, technology-enabled tools to improve cacao crop health and productivity.

279. 【2602.00214】A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification

链接https://arxiv.org/abs/2602.00214

作者:Juan A. Olmos,Antoine Manzanera,Fabio Martínez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Prostate cancer, men worldwide, common cancers, Prostate, cancers in men

备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.

280. 【2602.00212】Deep Learning Based CNN Model for Automated Detection of Pneumonia from Chest XRay Images

链接https://arxiv.org/abs/2602.00212

作者:Sathish Krishna Anumula,Vetrivelan Tamilmani,Aniruddha Arjun Singh,Dinesh Rajendran,Venkata Deepak Namburi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Network CNN, Limited Adaptive Histogram, Adaptive Histogram Equalization, Histogram Equalization CLAHE, Convolutional Neural Network

备注: 17 Pages, 2 Tables, 6 Figures

点击查看摘要

Abstract:Pneumonia has been one of the major causes of morbidities and mortality in the world and the prevalence of this disease is disproportionately high among the pediatric and elderly populations especially in resources trained areas Fast and precise diagnosis is a prerequisite for successful clinical intervention but due to inter observer variation fatigue among experts and a shortage of qualified radiologists traditional approaches that rely on manual interpretation of chest radiographs are frequently constrained To address these problems this paper introduces a unified automated diagnostic model using a custom Convolutional Neural Network CNN that can recognize pneumonia in chest Xray images with high precision and at minimal computational expense In contrast like other generic transfer learning based models which often possess redundant parameters the offered architecture uses a tailor made depth wise separable convolutional design which is optimized towards textural characteristics of grayscale medical images Contrast Limited Adaptive Histogram Equalization CLAHE and geometric augmentation are two significant preprocessing techniques used to ensure that the system does not experience class imbalance and is more likely to generalize The system is tested using a dataset of 5863 anterior posterior chest Xrays.

281. 【2602.00211】Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning

链接https://arxiv.org/abs/2602.00211

作者:Zafar Iqbal,Anwar Ul Haq,Srimannarayana Grandhi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:requires aligning complex, aligning complex anatomical, complex anatomical structures, registration requires aligning, reference labels

备注

点击查看摘要

Abstract:Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels, making interpretability and reliability critical. Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust. We propose a novel Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as a progressive reasoning process. Inspired by the iterative nature of clinical decision-making, each visual reasoning hop integrates a Localized Spatial Refinement (LSR) module to enrich feature representations and a Cross-Reference Attention (CRA) mechanism that leads the iterative refinement process, preserving anatomical consistency. This multi-hop strategy enables robust handling of large deformations and produces a transparent sequence of intermediate predictions with a theoretical bound. Beyond accuracy, our framework offers built-in interpretability by estimating uncertainty via the stability and convergence of deformation fields across hops. Extensive evaluations on two challenging public datasets, DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain), demonstrate that VCoR achieves competitive registration accuracy while offering rich intermediate visualizations and confidence measures. By embedding an implicit visual reasoning paradigm, we present an interpretable, reliable, and clinically viable unsupervised medical image registration.

282. 【2602.00205】Reducing Class-Wise Performance Disparity via Margin Regularization

链接https://arxiv.org/abs/2602.00205

作者:Beier Zhu,Kesen Zhao,Jiequan Cui,Qianru Sun,Yuan Zhou,Xun Yang,Hanwang Zhang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, Deep neural, exhibit substantial disparities, class-balanced data, posing concerns

备注: To appear in ICLR 2026

点击查看摘要

Abstract:Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced data, posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for Performance Disparity Reduction (MR$^2$), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for hard classes. Guided by this insight, MR$^2$ optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness. Experiments on seven datasets, including ImageNet, and diverse pre-trained backbones (MAE, MoCov2, CLIP) demonstrate that MR$^2$ not only improves overall accuracy but also significantly boosts hard class performance without trading off easy classes, thus reducing performance disparity. Code is available at: this https URL

283. 【2602.00202】Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images

链接https://arxiv.org/abs/2602.00202

作者:Shanwen Wang,Xin Sun,Danfeng Hong,Fei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:semi-supervised semantic segmentation, learn rich visual, rich visual knowledge, low-cost unlabeled images, semantic segmentation

备注

点击查看摘要

Abstract:The semi-supervised semantic segmentation (S4) can learn rich visual knowledge from low-cost unlabeled images. However, traditional S4 architectures all face the challenge of low-quality pseudo-labels, especially for the teacher-student this http URL propose a novel SemiEarth model that introduces vision-language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo-label purifying (VLM-PP) structure to purify the teacher network's pseudo-labels, achieving substantial improvements. Especially in multi-class boundary regions of RS images, the VLM-PP module can significantly improve the quality of pseudo-labels generated by the teacher, thereby correctly guiding the student model's learning. Moreover, since VLM-PP equips VLMs with open-world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low-confidence pseudo-labels whenever a discrepancy arises between its prediction and the pseudo-label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at this https URL.

284. 【2602.00192】AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

链接https://arxiv.org/abs/2602.00192

作者:Elif Nebioglu,Emirhan Bilgiç,Adrian Popescu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Modern deep learning-based, raising critical challenges, enables realistic local, Modern deep, local image manipulation

备注: 21 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Modern deep learning-based inpainting enables realistic local image manipulation, raising critical challenges for reliable detection. However, we observe that current detectors primarily rely on global artifacts that appear as inpainting side effects, rather than on locally synthesized content. We show that this behavior occurs because VAE-based reconstruction induces a subtle but pervasive spectral shift across the entire image, including unedited regions. To isolate this effect, we introduce Inpainting Exchange (INP-X), an operation that restores original pixels outside the edited region while preserving all synthesized content. We create a 90K test dataset including real, inpainted, and exchanged images to evaluate this phenomenon. Under this intervention, pretrained state-of-the-art detectors, including commercial ones, exhibit a dramatic drop in accuracy (e.g., from 91\% to 55\%), frequently approaching chance level. We provide a theoretical analysis linking this behavior to high-frequency attenuation caused by VAE information bottlenecks. Our findings highlight the need for content-aware detection. Indeed, training on our dataset yields better generalization and localization than standard inpainting. Our dataset and code are publicly available at this https URL.

285. 【2602.00191】GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models

链接https://arxiv.org/abs/2602.00191

作者:Yadang Alexis Rouzoumka,Jean Pinsolle,Eugénie Terreaux,Christèle Morisseau,Jean-Philippe Ovarlez,Chengfang Ren

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion models learn, Diffusion models, time-indexed score field, inherits approximate equivariances, circular shifts

备注: preprint

点击查看摘要

Abstract:Diffusion models learn a time-indexed score field $\mathbf{s}_\theta(\mathbf{x}_t,t)$ that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group $\mathcal{G}$, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance-residual functional over $\mathcal{G}$, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps. Code is available at this https URL.

286. 【2602.00190】From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

链接https://arxiv.org/abs/2602.00190

作者:Mohit Jiwatode,Alexander Dockhorn,Bodo Rosenhahn

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:achieve high performance, complex game domains, Large Language Models, causal game mechanics, General Video Game

备注: Submitted to ICPR 2026

点击查看摘要

Abstract:Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81\% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.

287. 【2602.00183】RPP: A Certified Poisoned-Sample Detection Framework for Backdoor Attacks under Dataset Imbalance

链接https://arxiv.org/abs/2602.00183

作者:Miao Lin,Feng Yu,Rui Ning,Lusi Li,Jiawei Chen,Qian Lou,Mengxin Zheng,Chunsheng Xin,Hongyi Wu

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep neural networks, Deep neural, pervasive class imbalance, amplify backdoor threats, Randomized Probability Perturbation

备注

点击查看摘要

Abstract:Deep neural networks are highly susceptible to backdoor attacks, yet most defense methods to date rely on balanced data, overlooking the pervasive class imbalance in real-world scenarios that can amplify backdoor threats. This paper presents the first in-depth investigation of how the dataset imbalance amplifies backdoor vulnerability, showing that (i) the imbalance induces a majority-class bias that increases susceptibility and (ii) conventional defenses degrade significantly as the imbalance grows. To address this, we propose Randomized Probability Perturbation (RPP), a certified poisoned-sample detection framework that operates in a black-box setting using only model output probabilities. For any inspected sample, RPP determines whether the input has been backdoor-manipulated, while offering provable within-domain detectability guarantees and a probabilistic upper bound on the false positive rate. Extensive experiments on five benchmarks (MNIST, SVHN, CIFAR-10, TinyImageNet and ImageNet10) covering 10 backdoor attacks and 12 baseline defenses show that RPP achieves significantly higher detection accuracy than state-of-the-art defenses, particularly under dataset imbalance. RPP establishes a theoretical and practical foundation for defending against backdoor attacks in real-world environments with imbalanced data.

288. 【2602.00181】CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

链接https://arxiv.org/abs/2602.00181

作者:Hang Wu,Yujun Cai,Zehao Li,Haonan Ge,Bowen Sun,Junsong Yuan,Yiwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video spatial intelligence, Understanding camera dynamics, spatial intelligence, fundamental pillar, pillar of video

备注

点击查看摘要

Abstract:Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.

289. 【2602.00176】Stabilizing Diffusion Posterior Sampling by Noise--Frequency Continuation

链接https://arxiv.org/abs/2602.00176

作者:Feng Tian,Yixuan Li,Weili Zeng,Weitian Zhang,Yichao Yan,Xiaokang Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:sampling solves inverse, solves inverse problems, pretrained diffusion prior, posterior sampling solves, recover fine details

备注

点击查看摘要

Abstract:Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise--frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.

290. 【2602.00175】he Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

链接https://arxiv.org/abs/2602.00175

作者:Manyi Li,Yufan Liu,Lai Jiang,Bing Li,Yuming Li,Weiming Hu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:unlearning-based defenses claim, claim to purge, concepts from diffusion, largely an illusion, Latent Variable Optimization

备注: 21 pages, 22 figures, 17 tables

点击查看摘要

Abstract:Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this "forgetting" is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at this http URL. Warning: This paper has unsafe images that may offend some readers.

291. 【2602.00174】Intra-Class Subdivision for Pixel Contrastive Learning: Application to Semi-supervised Cardiac Image Segmentation

链接https://arxiv.org/abs/2602.00174

作者:Jiajun Zhao,Xuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:address representation contamination, pixel contrastive learning, intra-class subdivision pixel, subdivision pixel contrastive, contrastive learning

备注: 5 pages, 7 figures, accepted by ICASSP 2026

点击查看摘要

Abstract:We propose an intra-class subdivision pixel contrastive learning (SPCL) framework for cardiac image segmentation to address representation contamination at boundaries. The novel concept ``Unconcerned sample'' is proposed to distinguish pixel representations at the inner and boundary regions within the same class, facilitating a clearer characterization of intra-class variations. A novel boundary contrastive loss for boundary representations is proposed to enhance representation discrimination across boundaries. The advantages of the unconcerned sample and boundary contrastive loss are analyzed theoretically. Experimental results in public cardiac datasets demonstrate that SPCL significantly improves segmentation performance, outperforming existing methods with respect to segmentation quality and boundary precision. Our code is available at this https URL.

292. 【2602.00168】YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation

链接https://arxiv.org/abs/2602.00168

作者:Ranjan Sapkota,Manoj Karkee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paradigm of YOLOE, open-vocabulary learning paradigm, paper presents, integrates the deployment-optimized, YOLOE for real-time

备注

点击查看摘要

Abstract:This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.

293. 【2602.00163】Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

链接https://arxiv.org/abs/2602.00163

作者:Laura Cif,Diane Demailly,Gabriella A. Horvàth,Juan Dario Ortigoza Escobar,Nathalie Dorison,Mayté Castro Jiménez,Cécile A. Hubsch,Thomas Wirth,Gun-Marie Hariz,Sophie Huby,Morgan Dornadic,Zohra Souei,Muhammad Mushhood Ur Rehman,Simone Hemm,Mehdi Boulayme,Eduardo M. Moraud,Jocelyne Bloch,Xavier Vasques

类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词:Hyperkinetic movement disorders, disabling motor manifestations, Hyperkinetic movement, movement disorders, childhood and adulthood

备注

点击查看摘要

Abstract:Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.

294. 【2602.00153】See Without Decoding: Motion-Vector-Based Tracking in Compressed Video

链接https://arxiv.org/abs/2602.00153

作者:Axel Duché,Clément Chatelain,Gilles Gasso

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:RGB video decoding, lightweight compressed-domain tracking, requiring full RGB, full RGB video, compressed-domain tracking model

备注

点击查看摘要

Abstract:We propose a lightweight compressed-domain tracking model that operates directly on video streams, without requiring full RGB video decoding. Using motion vectors and transform coefficients from compressed data, our deep model propagates object bounding boxes across frames, achieving a computational speed-up of order up to 3.7 with only a slight 4% mAP@0.5 drop vs RGB baseline on MOTS15/17/20 datasets. These results highlight codec-domain motion modeling efficiency for real-time analytics in large monitoring systems.

295. 【2602.00152】Real-Time Human Activity Recognition on Edge Microcontrollers: Dynamic Hierarchical Inference with Multi-Spectral Sensor Fusion

链接https://arxiv.org/abs/2602.00152

作者:Boyu Li,Kuangji Zuo,Lincong Li,Yonghui Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

关键词:existing approaches struggle, accurate on-device pattern, on-device pattern recognition, Pseudo-image Enhancement Fusion, Human Activity Recognition

备注: 24 pages, 6 figures. The manusrcipt is under review at Measurement

点击查看摘要

Abstract:The demand for accurate on-device pattern recognition in edge applications is intensifying, yet existing approaches struggle to reconcile accuracy with computational constraints. To address this challenge, a resource-aware hierarchical network based on multi-spectral fusion and interpretable modules, namely the Hierarchical Parallel Pseudo-image Enhancement Fusion Network (HPPI-Net), is proposed for real-time, on-device Human Activity Recognition (HAR). Deployed on an ARM Cortex-M4 microcontroller for low-power real-time inference, HPPI-Net achieves 96.70% accuracy while utilizing only 22.3 KiB of RAM and 439.5 KiB of ROM after optimization. HPPI-Net employs a two-layer architecture. The first layer extracts preliminary features using Fast Fourier Transform (FFT) spectrograms, while the second layer selectively activates either a dedicated module for stationary activity recognition or a parallel LSTM-MobileNet network (PLMN) for dynamic states. PLMN fuses FFT, Wavelet, and Gabor spectrograms through three parallel LSTM encoders and refines the concatenated features using Efficient Channel Attention (ECA) and Depthwise Separable Convolution (DSC), thereby offering channel-level interpretability while substantially reducing multiply-accumulate operations. Compared with MobileNetV3, HPPI-Net improves accuracy by 1.22% and reduces RAM usage by 71.2% and ROM usage by 42.1%. These results demonstrate that HPPI-Net achieves a favorable accuracy-efficiency trade-off and provides explainable predictions, establishing a practical solution for wearable, industrial, and smart home HAR on memory-constrained edge platforms.

296. 【2602.00151】Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency

链接https://arxiv.org/abs/2602.00151

作者:Alexander Blezinger,Wolfgang Nejdl,Ming Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:found great success, prediction remains underexplored, biomarker prediction remains, Foundation models pretrained, Foundation models

备注: 9 pages, 7 figures and 5 tables. Initialy submitted for IJCAI 2026

点击查看摘要

Abstract:Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.

297. 【2602.00149】SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles

链接https://arxiv.org/abs/2602.00149

作者:Shucong Li,Xiaoluo Zhou,Yuqian He,Zhenyu Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Internet of Vehicles, part in Internet, important part, Simulated Densifying, Vehicles

备注

点击查看摘要

Abstract:3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas' representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.

298. 【2602.00148】Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields

链接https://arxiv.org/abs/2602.00148

作者:Shiqian Li,Ruihong Shen,Junfeng Ni,Chang Pan,Chi Zhang,Yixin Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:raw visual data, visual data remains, Predicting physical dynamics, Predicting physical, data remains

备注: 43 pages, ICLR 2026

点击查看摘要

Abstract:Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.

299. 【2602.00145】DensiThAI, A Multi-View Deep Learning Framework for Breast Density Estimation using Infrared Images

链接https://arxiv.org/abs/2602.00145

作者:Siva Teja Kakileti,Geetha Manjunath

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:affecting mammographic sensitivity, major factor affecting, factor affecting mammographic, breast cancer risk, mammographic sensitivity

备注

点击查看摘要

Abstract:Breast tissue density is a key biomarker of breast cancer risk and a major factor affecting mammographic sensitivity. However, density assessment currently relies almost exclusively on X-ray mammography, an ionizing imaging modality. This study investigates the feasibility of estimating breast density using artificial intelligence over infrared thermal images, offering a non-ionizing imaging approach. The underlying hypothesis is that fibroglandular and adipose tissues exhibit distinct thermophysical and physiological properties, leading to subtle but spatially coherent temperature variations on the breast surface. In this paper, we propose DensiThAI, a multi-view deep learning framework for breast density classification from thermal images. The framework was evaluated on a multi-center dataset of 3,500 women using mammography-derived density labels as reference. Using five standard thermal views, DensiThAI achieved a mean AUROC of 0.73 across 10 random splits, with statistically significant separation between density classes across all splits (p 0.05). Consistent performance across age cohorts supports the potential of thermal imaging as a non-ionizing approach for breast density assessment with implications for improved patient experience and workflow optimization.

300. 【2602.00144】Scalable Analytic Classifiers with Associative Drift Compensation for Class-Incremental Learning of Vision Transformers

链接https://arxiv.org/abs/2602.00144

作者:Xuan Rao,Mingming Ha,Bo Zhao,Derong Liu,Cesare Alippi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:stochastic gradient descent, major computational bottleneck, existing methods rely, costly iterative stochastic, iterative stochastic gradient

备注

点击查看摘要

Abstract:Class-incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes-optimal alternative with accuracy comparable to SGD-based classifiers; however, its quadratic inference complexity limits its use in large-scale CIL scenarios. To overcome this, we propose Low-Rank Factorized RGDA (LR-RGDA), a scalable classifier that combines RGDA's expressivity with the efficiency of linear classifiers. By exploiting the low-rank structure of the covariance via the Woodbury matrix identity, LR-RGDA decomposes the discriminant function into a global affine term refined by a low-rank quadratic perturbation, reducing the inference complexity from $\mathcal{O}(Cd^2)$ to $\mathcal{O}(d^2 + Crd^2)$, where $C$ is the class number, $d$ the feature dimension, and $r \ll d$ the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield-based Distribution Compensator (HopDC), a training-free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with ViTs. Code: this https URL.

301. 【2602.00135】LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models

链接https://arxiv.org/abs/2602.00135

作者:Pengcheng Zheng,Chaoning Zhang,Jiarong Mo,GuoHui Li,Jiaquan Zhang,Jiahao Zhang,Sihan Cao,Sheng Zheng,Caiyan Qin,Guoqing Wang,Yang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved impressive performance, Large multimodal models, memory costs hinder, Large multimodal, vision-language tasks

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.

302. 【2602.00132】Shedding the Facades, Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation

链接https://arxiv.org/abs/2602.00132

作者:Jiao Li,Jian Lang,Xikai Tang,Wenzheng Shu,Ting Zhong,Qiang Gao,Yong Wang,Leiting Chen,Fan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hate Video Detection, Hate Video, Video Detection, online ecosystems, crucial for online

备注: Accepted by AAAI2026 main track

点击查看摘要

Abstract:Hate Video Detection (HVD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, hateful content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in HVD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for HVD. Motivated by the insight that, despite the evolving nature of hateful manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving hateful content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 4.69% in Macro-F1 over the best.

303. 【2602.00131】PovNet+: A Deep Learning Architecture for Socially Assistive Robots to Learn and Assist with Multiple Activities of Daily Living

链接https://arxiv.org/abs/2602.00131

作者:Fraser Robinson,Souren Pashangpour,Matthew Lisondra,Goldie Nejat

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:autonomous socially assistive, socially assistive robots, ADL, socially assistive, significant barrier

备注: Submitted to Advanced Robotics (Taylor Francis)

点击查看摘要

Abstract:A significant barrier to the long-term deployment of autonomous socially assistive robots is their inability to both perceive and assist with multiple activities of daily living (ADLs). In this paper, we present the first multimodal deep learning architecture, POVNet+, for multi-activity recognition for socially assistive robots to proactively initiate assistive behaviors. Our novel architecture introduces the use of both ADL and motion embedding spaces to uniquely distinguish between a known ADL being performed, a new unseen ADL, or a known ADL being performed atypically in order to assist people in real scenarios. Furthermore, we apply a novel user state estimation method to the motion embedding space to recognize new ADLs while monitoring user performance. This ADL perception information is used to proactively initiate robot assistive interactions. Comparison experiments with state-of-the-art human activity recognition methods show our POVNet+ method has higher ADL classification accuracy. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia using POVNet+ demonstrate the ability of our multi-modal ADL architecture in successfully identifying different seen and unseen ADLs, and ADLs being performed atypically, while initiating appropriate assistive human-robot interactions.

304. 【2602.00126】D3R-Net: Dual-Domain Denoising Reconstruction Network for Robust Industrial Anomaly Detection

链接https://arxiv.org/abs/2602.00126

作者:Dmytro Filatov,Valentyn Fedorov,Vira Filatova,Andrii Zelenchuk

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Unsupervised anomaly detection, automated visual inspection, Unsupervised anomaly, anomaly detection, modern manufacturing

备注: 9 pages

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) is a key ingredient of automated visual inspection in modern manufacturing. The reconstruction-based methods appeal because they have basic architectural design and they process data quickly but they produce oversmoothed results for high-frequency details. As a result, subtle defects are partially reconstructed rather than highlighted, which limits segmentation accuracy. We build on this line of work and introduce D3R-Net, a Dual-Domain Denoising Reconstruction framework that couples a self-supervised 'healing' task with frequency-aware regularization. During training, the network receives synthetically corrupted normal images and is asked to reconstruct the clean targets, which prevents trivial identity mapping and pushes the model to learn the manifold of defect-free textures. In addition to the spatial mean squared error, we employ a Fast Fourier Transform (FFT) magnitude loss that encourages consistency in the frequency domain. The implementation also allows an optional structural similarity (SSIM) term, which we study in an ablation. On the MVTec AD Hazelnut benchmark, D3R-Net with the FFT loss improves localization consistency over a spatial-only baseline: PRO AUC increases from 0.603 to 0.687, while image-level ROC AUC remains robust. Evaluated across fifteen MVTec categories, the FFT variant raises the average pixel ROC AUC from 0.733 to 0.751 and PRO AUC from 0.417 to 0.468 compared to the MSE-only baseline, at roughly 20 FPS on a single GPU. The network is trained from scratch and uses a lightweight convolutional autoencoder backbone, providing a practical alternative to heavy pre-trained feature embedding methods.

305. 【2602.00124】Context-Aware Autoencoders for Anomaly Detection in Maritime Surveillance

链接https://arxiv.org/abs/2602.00124

作者:Divya Acharya,Pierre Bernab'e,Antoine Chevrot,Helge Spieker,Arnaud Gotlieb,Bruno Legeard

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial to ensuring, ensuring the safety, safety and security, self-reported AIS messages, anomalies

备注

点击查看摘要

Abstract:The detection of anomalies is crucial to ensuring the safety and security of maritime vessel traffic surveillance. Although autoencoders are popular for anomaly detection, their effectiveness in identifying collective and contextual anomalies is limited, especially in the maritime domain, where anomalies depend on vessel-specific contexts derived from self-reported AIS messages. To address these limitations, we propose a novel solution: the context-aware autoencoder. By integrating context-specific thresholds, our method improves detection accuracy and reduces computational cost. We compare four context-aware autoencoder variants and a conventional autoencoder using a case study focused on fishing status anomalies in maritime surveillance. Results demonstrate the significant impact of context on reconstruction loss and anomaly detection. The context-aware autoencoder outperforms others in detecting anomalies in time series data. By incorporating context-specific thresholds and recognizing the importance of context in anomaly detection, our approach offers a promising solution to improve accuracy in maritime vessel traffic surveillance systems.

306. 【2602.00123】Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models

链接https://arxiv.org/abs/2602.00123

作者:Filip Nowicki,Hubert Marciniak,Jakub Łączkowski,Krzysztof Jassem,Tomasz Górecki,Vimala Balakrishnan,Desmond C. Ong,Maciej Behnke

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Affective Picture System, International Affective Picture, Nencki Affective Picture, Picture System, Affective Picture

备注

点击查看摘要

Abstract:Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.

307. 【2602.00122】VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

链接https://arxiv.org/abs/2602.00122

作者:Hongzhu Yi,Yujia Yang,Yuanxiang Wang,Zhenyu Guan,Jiahuan Chen,Chenxi Bao,Tiankun Yang,Yixuan Yuan,Tianyu Zong,Xinming Wang,Tao Yu,Ruiwen Tao,Haijin Liang,Jin Ma,Jinwen Luo,Yeshani Xinyu Zuo,Jungang Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:achieved substantial progress, image editing models, manipulate visual content, multimodal image editing, image editing

备注

点击查看摘要

Abstract:In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.

308. 【2602.00117】IC-EO: Interpretable Code-based assistant for Earth Observation

链接https://arxiv.org/abs/2602.00117

作者:Lamia Lahouel,Laurynas Lopata,Simon Gruening,Gabriele Meoni,Gaetan Petit,Sylvain Lobry

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Earth Observation, requiring expert knowledge, analysis remains difficult, computer vision, requiring expert

备注: 15 pages, 1 figure

点击查看摘要

Abstract:Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels: (i) tool-level performance on public EO benchmarks; (ii) at the agent-level to understand the capacity to generate valid, hallucination-free code; and (iii) at the task-level on specific use cases. In this work, we select two use-cases of interest: land-composition mapping and post-wildfire damage assessment. The proposed agent outperforms general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.

309. 【2602.00115】Event Driven Clustering Algorithm

链接https://arxiv.org/abs/2602.00115

作者:David El-Chai Ben-Ezra,Adar Tal,Daniel Brisk

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:small event clusters, paper introduces, real-time detection, detection of small, event clusters based

备注: ~10 pages, 2 figures

点击查看摘要

Abstract:This paper introduces a novel asynchronous, event-driven algorithm for real-time detection of small event clusters in event camera data. Like other hierarchical agglomerative clustering algorithms, the algorithm detects the event clusters based on their tempo-spatial distance. However, the algorithm leverages the special asynchronous data structure of event camera, and by a sophisticated, efficient and simple decision-making, enjoys a linear complexity of $O(n)$ where $n$ is the events amount. In addition, the run-time of the algorithm is independent with the dimensions of the pixels array.

310. 【2602.00114】1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization

链接https://arxiv.org/abs/2602.00114

作者:Yunwei Bai,Ying Kiat Tan,Yao Shu,Tsuhan Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Few-shot learning, test-time augmentations fail, challenges model generalization, traditional test-time augmentations, classes based

备注

点击查看摘要

Abstract:Few-shot learning (FSL) challenges model generalization to novel classes based on just a few shots of labeled examples, a testbed where traditional test-time augmentations fail to be effective. We introduce 1S-DAug, a one-shot generative augmentation operator that synthesizes diverse yet faithful variants from just one example image at test time. 1S-DAug couples traditional geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. The generated images are then encoded and aggregated, alongside the original image, into a combined representation for more robust FSL predictions. Integrated as a training-free model-agnostic plugin, 1S-DAug consistently improves FSL across standard benchmarks of 4 different datasets without any model parameter update, including achieving over 10% proportional accuracy improvement on the miniImagenet 5-way-1-shot benchmark. Codes will be released.

311. 【2602.00113】AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment

链接https://arxiv.org/abs/2602.00113

作者:S. Kalaycioglu,C. Hong,K. Zhai,H. Xie,J.N. Wong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:conventional visual inspection, reproducible burn assessment, burn assessment, medico-legal documentation, photography are subjective

备注: 11 pages and 5 figures

点击查看摘要

Abstract:Accurate, reproducible burn assessment is critical for treatment planning, healing monitoring, and medico-legal documentation, yet conventional visual inspection and 2D photography are subjective and limited for longitudinal comparison. This paper presents an AI-enabled burn assessment and management platform that integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation within a structured clinical workflow. Using standard multi-angle images from consumer-grade cameras, the system reconstructs patient-specific 3D burn surfaces and maps burn regions onto anatomy to compute objective metrics in real-world units, including surface area, TBSA, depth-related geometric proxies, and volumetric change. Successive reconstructions are spatially aligned to quantify healing progression over time, enabling objective tracking of wound contraction and depth reduction. The platform also supports structured patient intake, guided image capture, 3D analysis and visualization, treatment recommendations, and automated report generation. Simulation-based evaluation demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in acute and outpatient care.

312. 【2602.00111】From Manual Observation to Automated Monitoring: Space Allowance Effects on Play Behaviour in Group-Housed Dairy Calves

链接https://arxiv.org/abs/2602.00111

作者:Haiyu Yang,Heidi Lesscher,Enhong Liu,Miel Hostens

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:remains poorly characterized, conditions remains poorly, commercial conditions remains, Play behaviour serves, positive welfare indicator

备注

点击查看摘要

Abstract:Play behaviour serves as a positive welfare indicator in dairy calves, yet the influence of space allowance under commercial conditions remains poorly characterized, particularly at intermediate-to-high allowances (6-20 m2 per calf). This study investigated the relationship between space allowance and play behaviour in 60 group-housed dairy calves across 14 commercial farms in the Netherlands (space range: 2.66-17.98 m2 per calf), and developed an automated computer vision pipeline for scalable monitoring. Video observations were analyzed using a detailed ethogram, with play expressed as percentage of observation period (%OP). Statistical analysis employed linear mixed models with farm as a random effect. A computer vision pipeline was trained on manual annotations from 108 hours on 6 farms and validated on held-out test data. The computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection. Calves spent on average 1.0% of OP playing reflecting around 10 minutes per 17-hour period. The space-play relationship was non-linear, with highest play levels at 8-10 m2 per calf (1.6% OP) and lowest at 6-8 m2 and 12-14 m2 (0.6% OP). Space remained significant after controlling for age, health, and group size. In summary, these findings suggest that 8-10 m2 per calf represents a practical target balancing welfare benefits with economic feasibility, and demonstrate that automated monitoring can scale small annotation projects to continuous welfare assessment systems.

313. 【2602.00110】Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer

链接https://arxiv.org/abs/2602.00110

作者:Yu Li,Guilherme N. DeSouza,Praveen Rao,Chi-Ren Shyu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:driven major progress, detection and segmentation, transformers have driven, driven major, major progress

备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Visual transformers have driven major progress in remote sensing image analysis, particularly in object detection and segmentation. Recent vision-language and multimodal models further extend these capabilities by incorporating auxiliary information, including captions, question and answer pairs, and metadata, which broadens applications beyond conventional computer vision tasks. However, these models are typically optimized for semantic alignment between visual and textual content rather than geospatial understanding, and therefore are not suited for representing or reasoning with structured geospatial layers. In this study, we propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information. Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches. To facilitate cross-modal interaction, we design a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data, thereby directing the model toward the most relevant regions. In addition, the module assigns distinct roles to individual attention heads, allowing the model to capture complementary aspects of the guidance information and improving the interpretability of its predictions. Experimental results demonstrate that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, highlighting its effectiveness in multimodal geospatial understanding.

314. 【2602.00109】Robustness of Presentation Attack Detection in Remote Identity Validation Scenarios

链接https://arxiv.org/abs/2602.00109

作者:John J. Howard(SAIC Identity and Data Sciences Laboratory),Richard O. Plesh(SAIC Identity and Data Sciences Laboratory),Yevgeniy B. Sirotin(SAIC Identity and Data Sciences Laboratory),Jerry L. Tipton(SAIC Identity and Data Sciences Laboratory),Arun R. Vemury(U.S. Department of Homeland Security, Science and Technology Directorate)

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:remote identity validation, user-friendly remote identity, Presentation attack detection, attack detection, identity validation

备注: Accepted to the IEEE/CVF WACV 2026 Workshop on Generative, Adversarial and Presentation Attacks in Biometrics (GAPBio). 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Presentation attack detection (PAD) subsystems are an important part of effective and user-friendly remote identity validation (RIV) systems. However, ensuring robust performance across diverse environmental and procedural conditions remains a critical challenge. This paper investigates the impact of low-light conditions and automated image acquisition on the robustness of commercial PAD systems using a scenario test of RIV. Our results show that PAD systems experience a significant decline in performance when utilized in low-light or auto-capture scenarios, with a model-predicted increase in error rates by a factor of about four under low-light conditions and a doubling of those odds under auto-capture workflows. Specifically, only one of the tested systems was robust to these perturbations, maintaining a maximum bona fide presentation classification error rate below 3% across all scenarios. Our findings emphasize the importance of testing across diverse environments to ensure robust and reliable PAD performance in real-world applications.

315. 【2602.00108】SITUATE -- Synthetic Object Counting Dataset for VLM training

链接https://arxiv.org/abs/2602.00108

作者:René Peinl,Vincent Tischler,Patrick Schröder,Christian Groth

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:evaluating Vision Language, Vision Language Models, Vision Language, evaluating Vision, Language Models

备注: accepted at 21st International Conference on Computer Vision Theory and Applications

点击查看摘要

Abstract:We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.

316. 【2602.00107】Efficient UAV trajectory prediction: A multi-modal deep diffusion framework

链接https://arxiv.org/abs/2602.00107

作者:Yuan Gao,Xinyu Guo,Wenjing Xie,Zifan Wang,Hongwen Yu,Gongyang Li,Shugong Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:UAV trajectory prediction, Deep Fusion Framework, multi-modal UAV trajectory, prediction method based, UAV trajectory

备注: in Chinese language

点击查看摘要

Abstract:To meet the requirements for managing unauthorized UAVs in the low-altitude economy, a multi-modal UAV trajectory prediction method based on the fusion of LiDAR and millimeter-wave radar information is proposed. A deep fusion network for multi-modal UAV trajectory prediction, termed the Multi-Modal Deep Fusion Framework, is designed. The overall architecture consists of two modality-specific feature extraction networks and a bidirectional cross-attention fusion module, aiming to fully exploit the complementary information of LiDAR and radar point clouds in spatial geometric structure and dynamic reflection characteristics. In the feature extraction stage, the model employs independent but structurally identical feature encoders for LiDAR and radar. After feature extraction, the model enters the Bidirectional Cross-Attention Mechanism stage to achieve information complementarity and semantic alignment between the two modalities. To verify the effectiveness of the proposed model, the MMAUD dataset used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge is adopted as the training and testing dataset. Experimental results show that the proposed multi-modal fusion model significantly improves trajectory prediction accuracy, achieving a 40% improvement compared to the baseline model. In addition, ablation experiments are conducted to demonstrate the effectiveness of different loss functions and post-processing strategies in improving model performance. The proposed model can effectively utilize multi-modal data and provides an efficient solution for unauthorized UAV trajectory prediction in the low-altitude economy.

317. 【2602.00105】HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models

链接https://arxiv.org/abs/2602.00105

作者:Wing Chan,Richard Allen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:typically best-case samples, real workflows pay, image editing models, best-case samples, real workflows

备注: 14 pages, 5 figures, for code and data, see [this https URL](https://github.com/sourceful-official/hype-edit-1-benchmark)

点击查看摘要

Abstract:Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.

318. 【2602.00104】R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation

链接https://arxiv.org/abs/2602.00104

作者:Zhuohong Chen,Zhengxian Wu,Zirui Liao,Shenao Jiang,Hangrui Xu,Yang Chen,Chaokui Su,Xiaoyu Liu,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:VQA requires retrieving, supply missing visual, requires retrieving images, VQA requires, missing visual cues

备注

点击查看摘要

Abstract:Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains this http URL address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking this http URL first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence this http URL MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at this https URL.

319. 【2602.00096】Mirage2Matter: A Physically Grounded Gaussian World Model from Video

链接https://arxiv.org/abs/2602.00096

作者:Zhengqing Gao,Ziwen Li,Xin Wang,Jiaxin Huang,Zhenyang Ren,Mingkai Shao,Hanlue Zhang,Tianyu Huang,Yongkang Cheng,Yandong Guo,Runqi Lin,Yuanyuan Wang,Tongliang Liu,Kun Zhang,Mingming Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:fundamentally constrained, real-world interaction data, interaction data, real-world interaction, Gaussian Splatting

备注

点击查看摘要

Abstract:The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.

320. 【2602.00095】EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

链接https://arxiv.org/abs/2602.00095

作者:Weiyu Sun,Liangliang Chen,Yongnuo Cai,Huiru Xie,Yi Zeng,Ying Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.

321. 【2602.00092】Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

链接https://arxiv.org/abs/2602.00092

作者:Neha Kalibhat,Zi Wang,Prasoon Bajpai,Drew Proud,Wenjun Zeng,Been Kim,Mani Malek

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language summary, black-box interpretability framework, introduce a black-box, black-box interpretability, natural language

备注

点击查看摘要

Abstract:We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model's specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.

322. 【2602.00079】Lossless Embedding Compression via Spherical Coordinates

链接https://arxiv.org/abs/2602.00079

作者:Han Xiao

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:prior lossless method, prior lossless, causing IEEE, Abstract, times

备注

点击查看摘要

Abstract:We present a compression method for unit-norm embeddings that achieves 1.5$\times$ compression, 25% better than the best prior lossless method. The method exploits that spherical coordinates of high-dimensional unit vectors concentrate around $\pi/2$, causing IEEE 754 exponents to collapse to a single value and high-order mantissa bits to become predictable, enabling entropy coding of both. Reconstruction error is below 1e-7, under float32 machine epsilon. Evaluation across 26 configurations spanning text, image, and multi-vector embeddings confirms consistent improvement.

323. 【2602.00032】Happy Young Women, Grumpy Old Men? Emotion-Driven Demographic Biases in Synthetic Face Generation

链接https://arxiv.org/abs/2602.00032

作者:Mengting Wei,Aditya Gulati,Guoying Zhao,Nuria Oliver

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling high-fidelity image, multimodal large language, high-fidelity image production, large language models, enabling high-fidelity

备注: 23 pages, 11 figures

点击查看摘要

Abstract:Synthetic face generation has rapidly advanced with the emergence of text-to-image (T2I) and of multimodal large language models, enabling high-fidelity image production from natural-language prompts. Despite the widespread adoption of these tools, the biases, representational quality, and cross-cultural consistency of these models remain poorly understood. Prior research on biases in the synthetic generation of human faces has examined demographic biases, yet there is little research on how emotional prompts influence demographic representation and how models trained in different cultural and linguistic contexts vary in their output distributions. We present a systematic audit of eight state-of-the-art T2I models comprising four models developed by Western organizations and four developed by Chinese institutions, all prompted identically. Using state-of-the-art facial analysis algorithms, we estimate the gender, race, age, and attractiveness levels in the generated faces. To measure the deviations from global population statistics, we apply information-theoretic bias metrics including Kullback-Leibler and Jensen-Shannon divergences. Our findings reveal persistent demographic and emotion-conditioned biases in all models regardless of their country of origin. We discuss implications for fairness, socio-technical harms, governance, and the development of transparent generative systems.

324. 【2501.17634】Federated Learning With Individualized Privacy Through Client Sampling

链接https://arxiv.org/abs/2501.17634

作者:Lucas Lange,Ole Borchardt,Erhard Rahm

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse user privacy, user data collection, growing concerns, promising solution, solution to balance

备注: Accepted at 10th International Conference on Machine Learning Technologies (ICMLT 2025)

点击查看摘要

Abstract:With growing concerns about user data collection, individualized privacy has emerged as a promising solution to balance protection and utility by accounting for diverse user privacy preferences. Instead of enforcing a uniform level of anonymization for all users, this approach allows individuals to choose privacy settings that align with their comfort levels. Building on this idea, we propose an adapted method for enabling Individualized Differential Privacy (IDP) in Federated Learning (FL) by handling clients according to their personal privacy preferences. By extending the SAMPLE algorithm from centralized settings to FL, we calculate client-specific sampling rates based on their heterogeneous privacy budgets and integrate them into a modified IDP-FedAvg algorithm. We test this method under realistic privacy distributions and multiple datasets. The experimental results demonstrate that our approach achieves clear improvements over uniform DP baselines, reducing the trade-off between privacy and utility. Compared to the alternative SCALE method in related work, which assigns differing noise scales to clients, our method performs notably better. However, challenges remain for complex tasks with non-i.i.d. data, primarily stemming from the constraints of the decentralized setting.

325. 【2409.01329】Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

链接https://arxiv.org/abs/2409.01329

作者:Lucas Lange,Maurice-Maximilian Heykeroth,Erhard Rahm

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:including computer vision, Machine Learning, Privacy-Preserving Machine Learning, including computer, computer vision

备注: Accepted at 21st Conference on Database Systems for Business, Technology and Web (BTW 2025)

点击查看摘要

Abstract:Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.

326. 【2211.11434】Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)

链接https://arxiv.org/abs/2211.11434

作者:Lucas Lange,Maja Schneider,Peter Christen,Erhard Rahm

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling rapid screening, Machine learning, volumes of images, fight pandemics, enabling rapid

备注: Extended version of the paper accepted at the 20th International Conference on Security and Cryptography SECRYPT 2023. This version is more detailed and includes additional content: a longer results chapter and an appendix containing a proof

点击查看摘要

Abstract:Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practical privacy. We suggest improvements to address these open gaps. We account for inherent class imbalances and evaluate the utility-privacy trade-off more extensively and over stricter privacy budgets. Our evaluation is supported by empirically estimating practical privacy through black-box Membership Inference Attacks (MIAs). The introduced DP should help limit leakage threats posed by MIAs, and our practical analysis is the first to test this hypothesis on the COVID-19 classification task. Our results indicate that needed privacy levels might differ based on the task-dependent practical threat from MIAs. The results further suggest that with increasing DP guarantees, empirical privacy leakage only improves marginally, and DP therefore appears to have a limited impact on practical MIA defense. Our findings identify possibilities for better utility-privacy trade-offs, and we believe that empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy.

327. 【2602.02167】Real-Time 2D LiDAR Object Detection Using Three-Frame RGB Scan Encoding

链接https://arxiv.org/abs/2602.02167

作者:Soheil Behnam Roudsari,Alexandre S. Brandão,Felipe N. Martins

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Indoor service robots, RGB video, service robots, robots need perception, RGB

备注: 6 pages, 6 figures, submitted to IEEE SAS 2026

点击查看摘要

Abstract:Indoor service robots need perception that is robust, more privacy-friendly than RGB video, and feasible on embedded hardware. We present a camera-free 2D LiDAR object detection pipeline that encodes short-term temporal context by stacking three consecutive scans as RGB channels, yielding a compact YOLOv8n input without occupancy-grid construction while preserving angular structure and motion cues. Evaluated in Webots across 160 randomized indoor scenarios with strict scenario-level holdout, the method achieves 98.4% mAP@0.5 (0.778 mAP@0.5:0.95) with 94.9% precision and 94.7% recall on four object classes. On a Raspberry Pi 5, it runs in real time with a mean post-warm-up end-to-end latency of 47.8ms per frame, including scan encoding and postprocessing. Relative to a closely related occupancy-grid LiDAR-YOLO pipeline reported on the same platform, the proposed representation is associated with substantially lower reported end-to-end latency. Although results are simulation-based, they suggest that lightweight temporal encoding can enable accurate and real-time LiDAR-only detection for embedded indoor robotics without capturing RGB appearance.

328. 【2602.01681】Hyperspectral Image Fusion with Spectral-Band and Fusion-Scale Agnosticism

链接https://arxiv.org/abs/2602.01681

作者:Yu-Jie Liang,Zihan Cao,Liang-Jian Deng,Yang Yang,Malu Zhang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Hyperspectral Image Fusion, Current deep learning, Multispectral and Hyperspectral, Hyperspectral Image, Current deep

备注

点击查看摘要

Abstract:Current deep learning models for Multispectral and Hyperspectral Image Fusion (MS/HS fusion) are typically designed for fixed spectral bands and spatial scales, which limits their transferability across diverse sensors. To address this, we propose SSA, a universal framework for MS/HS fusion with spectral-band and fusion-scale agnosticism. Specifically, we introduce Matryoshka Kernel (MK), a novel operator that enables a single model to adapt to arbitrary numbers of spectral channels. Meanwhile, we build SSA upon an Implicit Neural Representation (INR) backbone that models the HS signal as a continuous function, enabling reconstruction at arbitrary spatial resolutions. Together, these two forms of agnosticism enable a single MS/HS fusion model that generalizes effectively to unseen sensors and spatial scales. Extensive experiments demonstrate that our single model achieves state-of-the-art performance while generalizing well to unseen sensors and scales, paving the way toward future HS foundation models.

329. 【2602.01577】Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

链接https://arxiv.org/abs/2602.01577

作者:Wenxuan Pan,Yang Yang,Dong Wei,Zhiyu Zhu,Jintao Wang,Huan Wu,Yao Nie

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

关键词:Camera-based visible light, low-cost indoor camera, Camera-based visible, Lamé curve-shaped LEDs, promising technique

备注: Submitted to an IEEE journal for possible publication

点击查看摘要

Abstract:Camera-based visible light positioning (VLP) is a promising technique for accurate and low-cost indoor camera pose estimation (CPE). To reduce the number of required light-emitting diodes (LEDs), advanced methods commonly exploit LED shape features for positioning. Although interesting, they are typically restricted to a single LED geometry, leading to failure in heterogeneous LED-shape scenarios. To address this challenge, this paper investigates Lamé curves as a unified representation of common LED shapes and proposes a generic VLP algorithm using Lamé curve-shaped LEDs, termed LC-VLP. In the considered system, multiple ceiling-mounted Lamé curve-shaped LEDs periodically broadcast their curve parameters via visible light communication, which are captured by a camera-equipped receiver. Based on the received LED images and curve parameters, the receiver can estimate the camera pose using LC-VLP. Specifically, an LED database is constructed offline to store the curve parameters, while online positioning is formulated as a nonlinear least-squares problem and solved iteratively. To provide a reliable initialization, a correspondence-free perspective-\textit{n}-points (FreeP\textit{n}P) algorithm is further developed, enabling approximate CPE without any pre-calibrated reference points. The performance of LC-VLP is verified by both simulations and experiments. Simulations show that LC-VLP outperforms state-of-the-art methods in both circular- and rectangular-LED scenarios, achieving reductions of over 40% in position error and 25% in rotation error. Experiments further show that LC-VLP can achieve an average position accuracy of less than 4 cm.

330. 【2602.01513】MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

链接https://arxiv.org/abs/2602.01513

作者:Xiaoxi Kong,Jieyu Yuan,Pengdi Chen,Yuanlin Zhang,Chongyi Li,Bin Li

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:conventional image-space attacks, exhibit strong robustness, watermarks exhibit strong, image-space attacks, exhibit strong

备注

点击查看摘要

Abstract:Semantic watermarks exhibit strong robustness against conventional image-space attacks. In this work, we show that such robustness does not survive under micro-geometric perturbations: spatial displacements can remove watermarks by breaking the phase alignment. Motivated by this observation, we introduce MarkCleaner, a watermark removal framework that avoids semantic drift caused by regeneration-based watermark removal. Specifically, MarkCleaner is trained with micro-geometry-perturbed supervision, which encourages the model to separate semantic content from strict spatial alignment and enables robust reconstruction under subtle geometric displacements. The framework adopts a mask-guided encoder that learns explicit spatial representations and a 2D Gaussian Splatting-based decoder that explicitly parameterizes geometric perturbations while preserving semantic content. Extensive experiments demonstrate that MarkCleaner achieves superior performance in both watermark removal effectiveness and visual fidelity, while enabling efficient real-time inference. Our code will be made available upon acceptance.

331. 【2602.01482】Community-Level Modeling of Gyral Folding Patterns for Robust and Anatomically Informed Individualized Brain Mapping

链接https://arxiv.org/abs/2602.01482

作者:Minheng Chen,Tong Chen,Yan Zhuang,Chao Cao,Jing Zhang,Tianming Liu,Lu Zhang,Dajiang Zhu

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibits substantial inter-individual, folding exhibits substantial, substantial inter-individual variability, preserving stable anatomical, enable fine-scale characterization

备注

点击查看摘要

Abstract:Cortical folding exhibits substantial inter-individual variability while preserving stable anatomical landmarks that enable fine-scale characterization of cortical organization. Among these, the three-hinge gyrus (3HG) serves as a key folding primitive, showing consistent topology yet meaningful variations in morphology, connectivity, and function. Existing landmark-based methods typically model each 3HG independently, ignoring that 3HGs form higher-order folding communities that capture mesoscale structure. This simplification weakens anatomical representation and makes one-to-one matching sensitive to positional variability and noise. We propose a spectral graph representation learning framework that models community-level folding units rather than isolated landmarks. Each 3HG is encoded using a dual-profile representation combining surface topology and structural connectivity. Subject-specific spectral clustering identifies coherent folding communities, followed by topological refinement to preserve anatomical continuity. For cross-subject correspondence, we introduce Joint Morphological-Geometric Matching, jointly optimizing geometric and morphometric similarity. Across over 1000 Human Connectome Project subjects, the resulting communities show reduced morphometric variance, stronger modular organization, improved hemispheric consistency, and superior alignment compared with atlas-based and landmark-based or embedding-based baselines. These findings demonstrate that community-level modeling provides a robust and anatomically grounded framework for individualized cortical characterization and reliable cross-subject correspondence.

332. 【2602.01444】A texture-based framework for foundational ultrasound models

链接https://arxiv.org/abs/2602.01444

作者:Tal Grutman,Carmel Shinar,Tali Ilovitsh

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:medical imaging modality, imaging modality, arising from tissue-dependent, tissue-dependent scattering, natural-image statistics

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Ultrasound is the most widely used medical imaging modality, yet the images it produces are fundamentally unique, arising from tissue-dependent scattering, reflection, and speed-of-sound variations that produce a constrained set of characteristic textures that differ markedly from natural-image statistics. These acoustically driven patterns make ultrasound challenging for algorithms originally designed for natural images. To bridge this gap, the field has increasingly turned to foundation models, hoping to leverage their generalization capabilities. However, these models often falter in ultrasound applications because they are not designed for ultrasound physics, they are merely trained on ultrasound data. Therefore, it is essential to integrate ultrasound-specific domain knowledge into established learning frameworks. We achieve this by reformulating self-supervised learning as a texture-analysis problem, introducing texture ultrasound semantic analysis (TUSA). Using TUSA, models learn to leverage highly scalable contrastive methods to extract true domain-specific representations directly from simple B-mode images. We train a TUSA model on a combination of open-source, simulated, and in vivo data. The latent space is compared to several larger foundation models, demonstrating that our approach gives TUSA models better generalizability for difficult downstream tasks on unique online datasets as well as a clinical eye dataset collected for this study. Our model achieves higher accuracy in detecting COVID (70%), spinal hematoma (100%) and vitreous hemorrhage (97%) and correlates more closely with quantitative parameters like liver steatosis (r = 0.83), ejection fraction (r = 0.63), and oxygen saturation (r = 0.38). We open-source the model weights and training script: this https URL

333. 【2602.00483】Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development

链接https://arxiv.org/abs/2602.00483

作者:Xihua Sheng,Xiongzhuang Liang,Chuanbo Tang,Zhirui Zuo,Yifan Bian,Yutao Xie,Zhuoyuan Li,Yuqi Li,Hui Xiang,Li Li,Dong Liu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:intelligent video coding, Video coding, video compression technologies, AVS video coding, efficient video compression

备注

点击查看摘要

Abstract:Video coding standards are essential to enable the interoperability and widespread adoption of efficient video compression technologies. In pursuit of greater video compression efficiency, the AVS video coding working group launched the standardization exploration of end-to-end intelligent video coding, establishing the AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) project. A core design principle of AVS-EEM is its focus on practical deployment, featuring inherently low computational complexity and requiring strict adherence to the common test conditions of conventional video coding. This paper details the development history of AVS-EEM and provides a systematic introduction to its key technical framework, covering model architectures, training strategies, and inference optimizations. These innovations have collectively driven the project's rapid performance evolution, enabling continuous and significant gains under strict complexity constraints. Through over two years of iterative refinement and collaborative effort, the coding performance of AVS-EEM has seen substantial improvement. Experimental results demonstrate that its latest model achieves superior compression efficiency compared to the conventional AVS3 reference software, marking a significant step toward a deployable intelligent video coding standard.

334. 【2602.00464】A 30-item Test for Assessing Chinese Character Amnesia in Child Handwriters

链接https://arxiv.org/abs/2602.00464

作者:Zebo Xu,Steven Langsford,Zhuang Qiu,Zhenguang Cai

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)

关键词:Chinese character amnesia, character amnesia, important skill, Chinese character, Handwriting

备注

点击查看摘要

Abstract:Handwriting literacy is an important skill for learning and communication in school-age children. In the digital age, handwriting has been largely replaced by typing, leading to a decline in handwriting proficiency, particularly in non-alphabetic writing systems. Among children learning Chinese, a growing number have reported experiencing character amnesia: difficulty in correctly handwriting a character despite being able to recognize it. Given that there is currently no standardized diagnostic tool for assessing character amnesia in children, we developed an assessment to measure Chinese character amnesia in Mandarin-speaking school-age population. We utilised a large-scale handwriting dataset in which 40 children handwrote 800 characters from dictation prompts. Character amnesia and correct handwriting responses were analysed using a two-parameter Item Response Theory model. Four item-selection schemes were compared: random baseline, maximum discrimination, diverse difficulty, and an upper-and-lower-thirds discrimination score. Candidate item subsets were evaluated using out-of-sample prediction. Among these selection schemes, the upper-and-lower-thirds discrimination procedure yields a compact 30-item test that preserves individual-difference structure and generalizes to unseen test-takers (cross-validated mean r =.74 with full 800-item-test; within-sample r =.93). This short-form test provides a reliable and efficient tool of assessing Chinese character amnesia in children and can be used to identify early handwriting and orthographic learning difficulties, contributing to the early detection of developmental dysgraphia and related literacy challenges.

335. 【2602.00324】Dual Quaternion SE(3) Synchronization with Recovery Guarantees

链接https://arxiv.org/abs/2602.00324

作者:Jianing Zhao,Linglingzhi Zhu,Anthony Man-Cho So

类目:Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Signal Processing (eess.SP)

关键词:special Euclidean group, noisy pairwise relative, pairwise relative transformations, recover absolute poses, special Euclidean

备注

点击查看摘要

Abstract:Synchronization over the special Euclidean group SE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multi-step heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates SE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admit a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.

336. 【2602.00221】Benchmarking Vanilla GAN, DCGAN, and WGAN Architectures for MRI Reconstruction: A Quantitative Analysis

链接https://arxiv.org/abs/2602.00221

作者:Humaira Mehwish,Hina Shakir,Muneeba Rashid,Asarim Aamir,Reema Qaiser Khan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, Magnetic Resonance, crucial imaging modality, Resonance Imaging, Vanilla GAN

备注: 20 pages

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a crucial imaging modality for viewing internal body structures. This research work analyses the performance of popular GAN models for accurate and precise MRI reconstruction by enhancing image quality and improving diagnostic accuracy. Three GAN architectures considered in this study are Vanilla GAN, Deep Convolutional GAN (DCGAN), and Wasserstein GAN (WGAN). They were trained and evaluated using knee, brain, and cardiac MRI datasets to assess their generalizability across body regions. While the Vanilla GAN operates on the fundamentals of the adversarial network setup, DCGAN advances image synthesis by securing the convolutional layers, giving a superior appearance to the prevalent spatial features. Training instability is resolved in WGAN through the Wasserstein distance to minimize an unstable regime, therefore, ensuring stable convergence and high-quality images. The GAN models were trained and tested using 1000 MR images of an anonymized knee, 805 images of Heart, 90 images of Brain MRI dataset. The Structural Similarity Index (SSIM) for Vanilla GAN is 0.84, DCGAN is 0.97, and WGAN is 0.99. The Peak Signal to Noise Ratio (PSNR) for Vanilla GAN is 26, DCGAN is 49.3, and WGAN is 43.5. The results were further statistically validated. This study shows that DCGAN and WGAN-based frameworks are promising in MR image reconstruction because of good image quality and superior accuracy. With the first cross-organ benchmark of baseline GANs under a common preprocessing pipeline, this work provides a reproducible benchmark for future hybrid GANs and clinical MRI applications.

337. 【2602.00220】Advanced Geometric Correction Algorithms for 3D Medical Reconstruction: Comparison of Computed Tomography and Macroscopic Imaging

链接https://arxiv.org/abs/2602.00220

作者:Tomasz Les,Tomasz Markiewicz,Malgorzata Lorent,Miroslaw Dziekiewicz,Krzysztof Siwek

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:geometric reference standard, hybrid two-stage registration, two-stage registration framework, reconstructing three-dimensional, reference standard

备注: 24 pages, 9 figures, submitted to Applied Sciences (MDPI)

点击查看摘要

Abstract:This paper introduces a hybrid two-stage registration framework for reconstructing three-dimensional (3D) kidney anatomy from macroscopic slices, using CT-derived models as the geometric reference standard. The approach addresses the data-scarcity and high-distortion challenges typical of macroscopic imaging, where fully learning-based registration (e.g., VoxelMorph) often fails to generalize due to limited training diversity and large nonrigid deformations that exceed the capture range of unconstrained convolutional filters. In the proposed pipeline, the Optimal Cross-section Matching (OCM) algorithm first performs constrained global alignment: translation, rotation, and uniform scaling to establish anatomically consistent slice initialization. Next, a lightweight deep-learning refinement network, inspired by VoxelMorph, predicts residual local deformations between consecutive slices. The core novelty of this architecture lies in its hierarchical decomposition of the registration manifold. This hybrid OCM+DL design integrates explicit geometric priors with the flexible learning capacity of neural networks, ensuring stable optimization and plausible deformation fields even with few training examples. Experiments on an original dataset of 40 kidneys demonstrated better results compared to single-stage baselines. The pipeline maintains physical calibration via Hough-based grid detection and employs Bezier-based contour smoothing for robust meshing and volume estimation. Although validated on kidney data, the proposed framework generalizes to other soft-tissue organs reconstructed from optical or photographic cross-sections. By decoupling interpretable global optimization from data-efficient deep refinement, the method advances the precision, reproducibility, and anatomical realism of multimodal 3D reconstructions for surgical planning, morphological assessment, and medical education.

338. 【2602.00215】A Renderer-Enabled Framework for Computing Parameter Estimation Lower Bounds in Plenoptic Imaging Systems

链接https://arxiv.org/abs/2602.00215

作者:Abhinav V. Sambasivan,Liam J. Coulter,Richard G. Paxman,Jarvis D. Haupt

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:scene parameter estimation, plenoptic imaging systems, lower bounds, computed lower bounds, parameter estimation error

备注

点击查看摘要

Abstract:This work focuses on assessing the information-theoretic limits of scene parameter estimation in plenoptic imaging systems. A general framework to compute lower bounds on the parameter estimation error from noisy plenoptic observations is presented, with a particular focus on passive indirect imaging problems, where the observations do not contain line-of-sight information about the parameter(s) of interest. Using computer graphics rendering software to synthesize the often-complicated dependence among parameter(s) of interest and observations, i.e. the forward model, the proposed framework evaluates the Hammersley-Chapman-Robbins bound to establish lower bounds on the variance of any unbiased estimator of the unknown parameters. The effects of inexact rendering of the true forward model on the computed lower bounds are also analyzed, both theoretically and via simulations. Experimental evaluations compare the computed lower bounds with the performance of the Maximum Likelihood Estimator on a canonical object localization problem, showing that the lower bounds computed via the framework proposed here are indicative of the true underlying fundamental limits in several nominally representative scenarios.

339. 【2602.00186】SurfelSoup: Learned Point Cloud Geometry Compression With a Probablistic SurfelTree Representation

链接https://arxiv.org/abs/2602.00186

作者:Tingyu Fan,Ran Gong,Yueyu Hu,Yao Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:paper presents SurfelSoup, learned surface-based framework, presents SurfelSoup, learned surface-based, paper presents

备注

点击查看摘要

Abstract:This paper presents SurfelSoup, an end-to-end learned surface-based framework for point cloud geometry compression, with surface-structured primitives for representation. It proposes a probabilistic surface representation, pSurfel, which models local point occupancies using a bounded generalized Gaussian distribution. In addition, the pSurfels are organized into an octree-like hierarchy, pSurfelTree, with a Tree Decision module that adaptively terminates the tree subdivision for rate-distortion optimal Surfel granularity selection. This formulation avoids redundant point-wise compression in smooth regions and produces compact yet smooth surface reconstructions. Experimental results under the MPEG common test condition show consistent gain on geometry compression over voxel-based baselines and MPEG standard G-PCC-GesTM-TriSoup, while providing visually superior reconstructions with smooth and coherent surface structures.

340. 【2602.00184】Visible Singularities Guided Correlation Network for Limited-Angle CT Reconstruction

链接https://arxiv.org/abs/2602.00184

作者:Yiyang Wen,Liu Shi,Zekun Zhou,WenZhe Shan,Qiegen Liu

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Limited-angle computed tomography, shortened scanning time, reduced radiation dose, Limited-angle computed, LACT reconstruction

备注

点击查看摘要

Abstract:Limited-angle computed tomography (LACT) offers the advantages of reduced radiation dose and shortened scanning time. Traditional reconstruction algorithms exhibit various inherent limitations in LACT. Currently, most deep learning-based LACT reconstruction methods focus on multi-domain fusion or the introduction of generic priors, failing to fully align with the core imaging characteristics of LACT-such as the directionality of artifacts and directional loss of structural information, which are caused by the absence of projection angles in certain directions. Inspired by the theory of visible and invisible singularities, taking into account the aforementioned core imaging characteristics of LACT, we propose a Visible Singularities Guided Correlation network for LACT reconstruction (VSGC). The design philosophy of VSGC consists of two core steps: First, extract VS edge features from LACT images and focus the model's attention on these VS. Second, establish correlations between the VS edge features and other regions of the image. Additionally, a multi-scale loss function with anisotropic constraint is employed to constrain the model to converge in multiple aspects. Finally, qualitative and quantitative validations are conducted on both simulated and real datasets to verify the effectiveness and feasibility of the proposed design. Particularly, in comparison with alternative methods, VSGC delivers more prominent performance in small angular ranges, with the PSNR improvement of 2.45 dB and the SSIM enhancement of 1.5\%. The code is publicly available at this https URL.

341. 【2602.00141】Comparison of Image Processing Models in Quark Gluon Jet Classification

链接https://arxiv.org/abs/2602.00141

作者:Daeun Kim,Jiwon Lee,Wonjun Jeong,Hyeongwoo Noh,Giyeong Kim,Jaeyoon Cho,Geonhee Kwak,Seunghwan Yang,MinJung Kweon

类目:Data Analysis, Statistics and Probability (physics.data-an); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)

关键词:simulated jet images, images from Pythia, present a comprehensive, comprehensive comparison, distinguishing quark

备注: 17 pages, 10 Figures

点击查看摘要

Abstract:We present a comprehensive comparison of convolutional and transformer-based models for distinguishing quark and gluon jets using simulated jet images from Pythia 8. By encoding jet substructure into a three-channel representation of particle kinematics, we evaluate the performance of convolutional neural networks (CNNs), Vision Transformers (ViTs), and Swin Transformers (Swin-Tiny) under both supervised and self-supervised learning setups. Our results show that fine-tuning only the final two transformer blocks of the Swin-Tiny model achieves the best trade-off between efficiency and accuracy, reaching 81.4% accuracy and an AUC (area under the ROC curve) of 88.9%. Self-supervised pretraining with Momentum Contrast (MoCo) further enhances feature robustness and reduces the number of trainable parameters. These findings highlight the potential of hierarchical attention-based models for jet substructure studies and for domain transfer to real collision data.

342. 【2602.00136】oward a Unified Semantic Loss Model for Deep JSCC-based Transmission of EO Imagery

链接https://arxiv.org/abs/2602.00136

作者:Ti Ti Nguyen,Thanh-Dung Le,Vu Nguyen Ha,Duc-Dung Tran,Hung Nguyen-Kha,Dinh-Hieu Tran,Carlos L. Marcos-Rojas,Juan C. Merlano-Duncan,Symeon Chatzinotas

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern Earth Observation, Modern Earth, Earth Observation, support critical applications, systems increasingly rely

备注: 5 pages, 5 figures

点击查看摘要

Abstract:Modern Earth Observation (EO) systems increasingly rely on high-resolution imagery to support critical applications such as environmental monitoring, disaster response, and land-use analysis. Although these applications benefit from detailed visual data, the resulting data volumes impose significant challenges on satellite communication systems constrained by limited bandwidth, power, and dynamic link conditions. To address these limitations, this paper investigates Deep Joint Source-Channel Coding (DJSCC) as an effective source-channel paradigm for the transmission of EO imagery. We focus on two complementary aspects of semantic loss in DJSCC-based systems. First, a reconstruction-centric framework is evaluated by analyzing the semantic degradation of reconstructed images under varying compression ratios and channel signal-to-noise ratios (SNR). Second, a task-oriented framework is developed by integrating DJSCC with lightweight, application-specific models (e.g., EfficientViT), with performance measured using downstream task accuracy rather than pixel-level fidelity. Based on extensive empirical analysis, we propose a unified semantic loss framework that captures both reconstruction-centric and task-oriented performance within a single model. This framework characterizes the implicit relationship between JSCC compression, channel SNR, and semantic quality, offering actionable insights for the design of robust and efficient EO imagery transmission under resource-constrained satellite links.

343. 【2602.00100】Frequent Pattern Mining approach to Image Compression

链接https://arxiv.org/abs/2602.00100

作者:Avinash Kadimisetty,C. Oswald,B. Sivalselvan

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Frequent Pattern Mining, Frequent Sequence Mining, explaining efficient approaches, Sequential Pattern Mining, efficient approaches based

备注

点击查看摘要

Abstract:The paper focuses on Image Compression, explaining efficient approaches based on Frequent Pattern Mining(FPM). The proposed compression mechanism is based on clustering similar pixels in the image and thus using cluster identifiers in image compression. Redundant data in the image is effectively handled by replacing the DCT phase of conventional JPEG through a mixture of k-means Clustering and Closed Frequent Sequence Mining. To optimize the cardinality of pattern(s) in encoding, efficient pruning techniques have been used through the refinement of Conventional Generalized Sequential Pattern Mining(GSP) algorithm. We have proposed a mechanism for finding the frequency of a sequence which will yield significant reduction in the code table size. The algorithm is tested by compressing benchmark datasets yielding an improvement of 45% in compression ratios, often outperforming the existing alternatives. PSNR and SSIM, which are the image quality metrics, have been tested which show a negligible loss in visual quality.