本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新853篇论文,其中:

  • 自然语言处理130
  • 信息检索23
  • 计算机视觉152

自然语言处理

1. 【2602.03845】Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

链接https://arxiv.org/abs/2602.03845

作者:Tong Zheng,Chengsong Huang,Runpeng Dai,Yun He,Rui Liu,Xin Ni,Huiwen Bao,Kaishen Wang,Hongtu Zhu,Jiaxin Huang,Furong Huang,Heng Huang

类目:Computation and Language (cs.CL)

关键词:significant computational burdens, imposes significant computational, computational burdens, promising paradigm, imposes significant

备注: 14 pages

点击查看摘要

Abstract:Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.

2. 【2602.03837】Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

链接https://arxiv.org/abs/2602.03837

作者:David P. Woodruff,Vincent Cohen-Addad,Lalit Jain,Jieming Mao,Song Zuo,MohammadHossein Bateni,Simina Branzei,Michael P. Brenner,Lin Chen,Ying Feng,Lance Fortnow,Gang Fu,Ziyi Guan,Zahra Hadizadeh,Mohammad T. Hajiaghayi,Mahdi JafariRaviz,Adel Javanmard,Karthik C. S.,Ken-ichi Kawarabayashi,Ravi Kumar,Silvio Lattanzi,Euiwoong Lee,Yi Li,Ioannis Panageas,Dimitris Paparas,Benjamin Przybocki,Bernardo Subercaseaux,Ola Svensson,Shayan Taherijam,Xuan Wu,Eylon Yogev,Morteza Zadimoghaddam,Samson Zhou,Vahab Mirrokni

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Recent advances, large language models, advances in large, large language, opened new avenues

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

3. 【2602.03828】AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

链接https://arxiv.org/abs/2602.03828

作者:Minjun Zhu,Zhen Lin,Yixuan Weng,Panzhong Lu,Qiujie Xie,Yifan Wei,Sifan Liu,Qiyao Sun,Yue Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:effectively communicating complex, manual creation remains, communicating complex scientific, High-quality scientific illustrations, scientific

备注: Accepted at the ICLR 2026

点击查看摘要

Abstract:High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in this https URL.

4. 【2602.03822】hey Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

链接https://arxiv.org/abs/2602.03822

作者:Sahil Tripathi,Gautam Siddharth Kashyap,Mehwish Nasim,Jian Yang,Jiechao Gao,Usman Naseem

类目:Computation and Language (cs.CL)

关键词:Meme-based social abuse, subtle cross-modal incongruence, social abuse detection, Meme-based social, implicit cultural symbolism

备注: Accepted at the The Web Conference 2026 (Research Track)

点击查看摘要

Abstract:Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.

5. 【2602.03812】Antidistillation Fingerprinting

链接https://arxiv.org/abs/2602.03812

作者:Yixuan Even Xu,John Kirchenbauer,Yash Savani,Asher Trockman,Alexander Robey,Tom Goldstein,Fei Fang,J. Zico Kolter

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enables efficient emulation, frontier large language, distillation enables efficient, teacher model outputs, large language models

备注: 26 pages, 11 figures

点击查看摘要

Abstract:Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.

6. 【2602.03806】Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

链接https://arxiv.org/abs/2602.03806

作者:Ziru Chen,Dongdong Chen,Ruinan Jin,Yingbin Liang,Yujia Xie,Huan Sun

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:large language models, significant research interests, training large language, multi-turn code generation, code generation

备注

点击查看摘要

Abstract:Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at this https URL.

7. 【2602.03798】FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

链接https://arxiv.org/abs/2602.03798

作者:Zimu Lu,Houxing Ren,Yunqiao Yang,Ke Wang,Zhuofan Zong,Mingjie Zhan,Hongsheng Li

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Assisting non-expert users, develop complex interactive, Assisting non-expert, complex interactive websites, frontend web pages

备注

点击查看摘要

Abstract:Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at this https URL.

8. 【2602.03792】WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

链接https://arxiv.org/abs/2602.03792

作者:Xilong Wang,Yinuo Liu,Zhun Wang,Dawn Song,Neil Gong

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:execute attacker-specified tasks, Prompt injection attacks, injection attacks manipulate, Prompt injection, web agents

备注

点击查看摘要

Abstract:Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emph{segments of interest} that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: this https URL.

9. 【2602.03786】AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

链接https://arxiv.org/abs/2602.03786

作者:Jianhao Ruan,Zhihao Xu,Yiran Peng,Fashen Ren,Zhaoyang Yu,Xinbing Liang,Jinyu Xiang,Bang Liu,Chenglin Wu,Yuyu Luo,Jiayi Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shown strong promise, Language agents, shown strong, strong promise, Language

备注

点击查看摘要

Abstract:Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: this https URL

10. 【2602.03784】Context Compression via Explicit Information Transmission

链接https://arxiv.org/abs/2602.03784

作者:Jiangnan Ye,Hanqi Yan,Zhenyi Shen,Heng Chang,Ye Mao,Yulan He

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, growing key-value caches, inference with Large, Language Models

备注

点击查看摘要

Abstract:Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.

11. 【2602.03783】Efficient Estimation of Kernel Surrogate Models for Task Attribution

链接https://arxiv.org/abs/2602.03783

作者:Zhenshuo Zhang,Minxuan Duan,Hongyang R. Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Modern AI agents, surrogate models, large language models, code generation, kernel surrogate models

备注: 27 pages. To appear in ICLR 2026

点击查看摘要

Abstract:Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple domains -- including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a $40\%$ improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.

12. 【2602.03731】CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora, 16 GB RAM, Single-Device Deployment

链接https://arxiv.org/abs/2602.03731

作者:Paolo Astrino

类目:Computation and Language (cs.CL)

关键词:Organizations handling sensitive, handling sensitive documents, sensitive documents face, local systems typically, systems typically require

备注: 24 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Organizations handling sensitive documents face a tension: cloud-based AI risks GDPR violations, while local systems typically require 18-32 GB RAM. This paper presents CUBO, a systems-oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO's novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware-aware orchestration that enables competitive Recall@10 (0.48-0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000-line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local-only processing aligned with GDPR Art. 5(1)(c). Evaluation on BEIR benchmarks validates practical deployability for small-to-medium professional archives. The codebase is publicly available at this https URL.

13. 【2602.03719】raining Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

链接https://arxiv.org/abs/2602.03719

作者:Yubao Zhao,Weiquan Huang,Sudong Wang,Ruochen Zhao,Chen Chen,Yao Shu,Chengwei Qin

类目:Computation and Language (cs.CL)

关键词:Agentic reinforcement learning, enabled large language, large language models, perform complex multi-turn, complex multi-turn planning

备注: 24 pages, 5 figures

点击查看摘要

Abstract:Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{this https URL}{code}.

14. 【2602.03709】No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

链接https://arxiv.org/abs/2602.03709

作者:Vynska Amalia Permadi,Xingwei Tan,Nafise Sadat Moosavi,Nikos Aletras

类目:Computation and Language (cs.CL)

关键词:implicit social knowledge, recalling isolated facts, Understanding culture requires, culture requires reasoning, social knowledge

备注

点击查看摘要

Abstract:Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.

15. 【2602.03708】Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

链接https://arxiv.org/abs/2602.03708

作者:Ximing Dong,Shaowei Wang,Dayi Lin,Boyuan Chen,Ahmed E. Hassan

类目:Computation and Language (cs.CL); Performance (cs.PF)

关键词:Large Language Models, Large Reasoning Models, Large Language, Language Models, high inference latency

备注

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific meanings. Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.

16. 【2602.03707】OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

链接https://arxiv.org/abs/2602.03707

作者:Yifan Zhu,Xinyu Mu,Tao Feng,Zhonghong Ou,Yuning Gong,Haoran Luo

类目:Computation and Language (cs.CL)

关键词:Long-horizon omnimodal question, omnimodal question answering, question answering answers, answering answers questions, Long-horizon omnimodal

备注

点击查看摘要

Abstract:Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end this http URL address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

17. 【2602.03704】Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

链接https://arxiv.org/abs/2602.03704

作者:Yu Tian,Linh Huynh,Katerina Christhilf,Shubham Chakraborty,Micah Watanabe,Tracy Arner,Danielle McNamara

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, made automated multiple-choice, cognitive demands remains, Recent advances, satisfy controlled cognitive

备注: This manuscript is under review at Electronics

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.

18. 【2602.03696】Conflict-Resolving and Sharpness-Aware Minimization for Generalized Knowledge Editing with Multiple Updates

链接https://arxiv.org/abs/2602.03696

作者:Duy Nguyen,Hanqi Xiao,Archiki Prasad,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, Large language, rely on internal, downstream tasks, making it crucial

备注: 22 pages, 8 figures. Code link: [this https URL](https://github.com/duykhuongnguyen/CoRSA)

点击查看摘要

Abstract:Large language models (LLMs) rely on internal knowledge to solve many downstream tasks, making it crucial to keep them up to date. Since full retraining is expensive, prior work has explored efficient alternatives such as model editing and parameter-efficient fine-tuning. However, these approaches often break down in practice due to poor generalization across inputs, limited stability, and knowledge conflict. To address these limitations, we propose the CoRSA (Conflict-Resolving and Sharpness-Aware Minimization) training framework, a parameter-efficient, holistic approach for knowledge editing with multiple updates. CoRSA tackles multiple challenges simultaneously: it improves generalization to different input forms and enhances stability across multiple updates by minimizing loss curvature, and resolves conflicts by maximizing the margin between new and prior knowledge. Across three widely used fact editing benchmarks, CoRSA achieves significant gains in generalization, outperforming baselines with average absolute improvements of 12.42% over LoRA and 10% over model editing methods. With multiple updates, it maintains high update efficacy while reducing catastrophic forgetting by 27.82% compared to LoRA. CoRSA also generalizes to the code domain, outperforming the strongest baseline by 5.48% Pass@5 in update efficacy.

19. 【2602.03695】Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

链接https://arxiv.org/abs/2602.03695

作者:Haibo Jin,Kuang Peng,Ye Yu,Xiaopeng Yuan,Haohan Wang

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:increased architectural complexity, manually crafted agent, crafted agent roles, handle complex problems, highly task-specific

备注: 16 pages

点击查看摘要

Abstract:While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbf{Agent Primitives}, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5\% over single-agent baselines, reduce token usage and inference latency by approximately 3$\times$-4$\times$ compared to text-based MAS, while incurring only 1.3$\times$-1.6$\times$ overhead relative to single-agent inference and providing more stable performance across model backbones.

Comments:
16 pages

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.03695 [cs.MA]

(or
arXiv:2602.03695v1 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2602.03695

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2602.03693】OCRTurk: A Comprehensive OCR Benchmark for Turkish

链接https://arxiv.org/abs/2602.03693

作者:Deniz Yılmaz,Evren Ayberk Munis,Çağrı Toraman,Süha Kağan Köse,Burak Aktaş,Mehmet Can Baytekin,Bilge Kaan Görür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large-scale document digitization, retrieval-augmented generation, Turkish document parsing, healthcare and education, domain-specific pipelines

备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.

21. 【2602.03689】Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

链接https://arxiv.org/abs/2602.03689

作者:Jiashuo Sun,Pengcheng Jiang,Saizhuo Wang,Jiajun Fan,Heng Wang,Siru Ouyang,Ming Zhong,Yizhu Jiao,Chengsong Huang,Xueqiang Xu,Pengrui Han,Peiran Li,Jiaxin Huang,Ge Liu,Heng Ji,Jiawei Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:systems remain brittle, Retrieval-Augmented Generation, realistic retrieval noise, generator Goldilocks Zone, systems remain

备注: 19 pages, 8 tables, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator's Goldilocks Zone -- evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at this https URL.

22. 【2602.03681】Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

链接https://arxiv.org/abs/2602.03681

作者:Difan Deng,Andreas Bentzen Winje,Lukas Fehring,Marius Lindauer

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:linear attention, attention, long-context scenarios, softmax attention, linear attention model

备注: 17 pages, 8 figures

点击查看摘要

Abstract:The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.

23. 【2602.03677】Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

链接https://arxiv.org/abs/2602.03677

作者:Yu Zhang,Mufan Xu,Xuefeng Bai,Kehai chen,Pengfei Zhang,Yang Xiang,Min Zhang

类目:Computation and Language (cs.CL)

关键词:selectively utilize multimodal, multimodal large language, utilize multimodal contexts, multimodal contexts based, large language models

备注: Modality Following

点击查看摘要

Abstract:Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere $5\%$ of these critical heads can decrease the modality-following ratio by $60\%$ through blocking, or increase it by $60\%$ through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.

24. 【2602.03652】RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

链接https://arxiv.org/abs/2602.03652

作者:Süha Kağan Köse,Mehmet Can Baytekin,Burak Aktaş,Bilge Kaan Görür,Evren Ayberk Munis,Deniz Yılmaz,Muhammed Yusuf Kartal,Çağrı Toraman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:enhances LLM factuality, guidance remains English-centric, design guidance remains, morphologically rich languages, Retrieval-Augmented Generation

备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.

25. 【2602.03647】Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

链接https://arxiv.org/abs/2602.03647

作者:Bowei He,Minda Hu,Zenan Xu,Hongru Wang,Licheng Zong,Yankai Chen,Chen Ma,Xue Liu,Pluto Zhou,Irwin King

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:querying external sources, Search-integrated reasoning enables, transcend static parametric, static parametric knowledge, actively querying external

备注

点击查看摘要

Abstract:Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

26. 【2602.03640】utorial on Reasoning for IR IR for Reasoning

链接https://arxiv.org/abs/2602.03640

作者:Mohanna Hoveyda,Panagiotis Efstratiadis,Arjen de Vries,Maarten de Rijke

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantic relatedness, long focused, focused on ranking, ranking documents, documents by semantic

备注: Accepted to ECIR 2026

点击查看摘要

Abstract:Information retrieval has long focused on ranking documents by semantic relatedness. Yet many real-world information needs demand more: enforcement of logical constraints, multi-step inference, and synthesis of multiple pieces of evidence. Addressing these requirements is, at its core, a problem of reasoning. Across AI communities, researchers are developing diverse solutions for the problem of reasoning, from inference-time strategies and post-training of LLMs, to neuro-symbolic systems, Bayesian and probabilistic frameworks, geometric representations, and energy-based models. These efforts target the same problem: to move beyond pattern-matching systems toward structured, verifiable inference. However, they remain scattered across disciplines, making it difficult for IR researchers to identify the most relevant ideas and opportunities. To help navigate the fragmented landscape of research in reasoning, this tutorial first articulates a working definition of reasoning within the context of information retrieval and derives from it a unified analytical framework. The framework maps existing approaches along axes that reflect the core components of the definition. By providing a comprehensive overview of recent approaches and mapping current methods onto the defined axes, we expose their trade-offs and complementarities, highlight where IR can benefit from cross-disciplinary advances, and illustrate how retrieval process itself can play a central role in broader reasoning systems. The tutorial will equip participants with both a conceptual framework and practical guidance for enhancing reasoning-capable IR systems, while situating IR as a domain that both benefits and contributes to the broader development of reasoning methodologies.

27. 【2602.03635】RE: Encouraging Exploration in the Trust Region

链接https://arxiv.org/abs/2602.03635

作者:Chao Huang,Yujing Lu,Quangang Li,Shenghe Wang,Yan Wang,Yueyang Zhang,Long Xia,Jiashu Zhao,Zhiyuan Sun,Daiting Shi,Tingwen Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, yields negligible effects, performance in Large, Language Models

备注

点击查看摘要

Abstract:Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at this https URL.

28. 【2602.03633】BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

链接https://arxiv.org/abs/2602.03633

作者:Burak Aktaş,Mehmet Can Baytekin,Süha Kağan Köse,Ömer İlbilgi,Elif Özge Yılmaz,Çağrı Toraman,Bilge Kaan Görür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:English benchmarks, low-resource languages remains, remains largely unexplored, achieved strong performance, languages remains largely

备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.

29. 【2602.03619】Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

链接https://arxiv.org/abs/2602.03619

作者:Changze Lv,Jie Zhou,Wentao Zhao,Jingwen Xu,Zisu Huang,Muzhao Tian,Shihan Dou,Tao Gui,Le Tian,Xiao Zhou,Xiaoqing Zheng,Xuanjing Huang,Jie Zhou

类目:Computation and Language (cs.CL)

关键词:remain challenging due, verifiable reward signals, evaluating DeepResearch-generated reports, DeepResearch-generated reports remain, reports remain challenging

备注

点击查看摘要

Abstract:Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

30. 【2602.03608】Controlling Output Rankings in Generative Engines for LLM-based Search

链接https://arxiv.org/abs/2602.03608

作者:Haibo Jin,Ruoxi Chen,Peiyan Zhang,Yifeng Luo,Huimin Zeng,Man Luo,Haohan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, textbf, rise of large, search, CORE

备注: 23 pages

点击查看摘要

Abstract:The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbf{C}ontrols \textbf{O}utput \textbf{R}ankings in g\textbf{E}nerative Engines for LLM-based search. Since the LLM's interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon's search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf{91.4\% @Top-5}, \textbf{86.6\% @Top-3}, and \textbf{80.3\% @Top-1}, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content.

Comments:
23 pages

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.03608 [cs.CL]

(or
arXiv:2602.03608v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.03608

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
31. 【2602.03588】Efficient Algorithms for Partial Constraint Satisfaction Problems over Control-flow Graphs

链接https://arxiv.org/abs/2602.03588

作者:Xuran Cai,Amir Goharshady

类目:Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词:Constraint Satisfaction Problem, Satisfaction Problem, Partial Constraint Satisfaction, Constraint Satisfaction, well-known Constraint Satisfaction

备注: Already accepted by SETTA'25. [this https URL](https://www.setta2025.uk/accepted-papers) . arXiv admin note: substantial text overlap with [arXiv:2507.16660](https://arxiv.org/abs/2507.16660)

点击查看摘要

Abstract:In this work, we focus on the Partial Constraint Satisfaction Problem (PCSP) over control-flow graphs (CFGs) of programs. PCSP serves as a generalization of the well-known Constraint Satisfaction Problem (CSP). In the CSP framework, we define a set of variables, a set of constraints, and a finite domain $D$ that encompasses all possible values for each variable. The objective is to assign a value to each variable in such a way that all constraints are satisfied. In the graph variant of CSP, an underlying graph is considered and we have one variable corresponding to each vertex of the graph and one or several constraints corresponding to each edge. In PCSPs, we allow for certain constraints to be violated at a specified cost, aiming to find a solution that minimizes the total cost. Numerous classical compiler optimization tasks can be framed as PCSPs over control-flow graphs. Examples include Register Allocation, Lifetime-optimal Speculative Partial Redundancy Elimination (LOSPRE), and Optimal Placement of Bank Selection Instructions. On the other hand, it is well-known that control-flow graphs of structured programs are sparse and decomposable in a variety of ways. In this work, we rely on the Series-Parallel-Loop (SPL) decompositions as introduced by~\cite{RegisterAllocation}. Our main contribution is a general algorithm for PCSPs over SPL graphs with a time complexity of \(O(|G| \cdot |D|^6)\), where \(|G|\) represents the size of the control-flow graph. Note that for any fixed domain $D,$ this yields a linear-time solution. Our algorithm can be seen as a generalization and unification of previous SPL-based approaches for register allocation and LOSPRE. In addition, we provide experimental results over another classical PCSP task, i.e. Optimal Bank Selection, achieving runtimes four times better than the previous state of the art.

32. 【2602.03587】CL-bench: A Benchmark for Context Learning

链接https://arxiv.org/abs/2602.03587

作者:Shihan Dou,Ming Zhang,Zhangyue Yin,Chenhao Huang,Yujiong Shen,Junzhe Wang,Jiayi Chen,Yuchen Ni,Junjie Ye,Cheng Zhang,Huaibing Xie,Jianglu Hu,Shaolei Wang,Weichao Wang,Yanling Xiao,Yiting Liu,Zenan Xu,Zhen Guo,Pluto Zhou,Tao Gui,Zuxuan Wu,Xipeng Qiu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Di Wang,Shunyu Yao

类目:Computation and Language (cs.CL)

关键词:Current language models, Current language, excel at reasoning, reasoning over prompts, prompts using pre-trained

备注: 78 pages, 17 figures

点击查看摘要

Abstract:Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.

33. 【2602.03584】$V_0$: A Generalist Value Model for Any Policy at State Zero

链接https://arxiv.org/abs/2602.03584

作者:Yi-Kai Zhang,Zhiyuan Yao,Hongyan Hao,Yueqing Sun,Qi Gu,Hui Su,Xunliang Cai,De-Chuan Zhan,Han-Jia Ye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:gradient methods rely, model reinforces behaviors, Policy gradient methods, Large Language Models, Relative Policy Optimization

备注

点击查看摘要

Abstract:Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

34. 【2602.03578】Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs

链接https://arxiv.org/abs/2602.03578

作者:Su Dong,Qinggang Zhang,Yilin Xiao,Shengyuan Chen,Chuang Zhou,Xiao Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, knowledge-intensive tasks due, Large language, outdated parametric knowledge, language models

备注

点击查看摘要

Abstract:Large language models (LLMs) often struggle with knowledge-intensive tasks due to hallucinations and outdated parametric knowledge. While Retrieval-Augmented Generation (RAG) addresses this by integrating external corpora, its effectiveness is limited by fragmented information in unstructured domain documents. Graph-augmented RAG (GraphRAG) emerged to enhance contextual reasoning through structured knowledge graphs, yet paradoxically underperforms vanilla RAG in real-world scenarios, exhibiting significant accuracy drops and prohibitive latency despite gains on complex queries. We identify the rigid application of GraphRAG to all queries, regardless of complexity, as the root cause. To resolve this, we propose an efficient and adaptive GraphRAG framework called EA-GraphRAG that dynamically integrates RAG and GraphRAG paradigms through syntax-aware complexity analysis. Our approach introduces: (i) a syntactic feature constructor that parses each query and extracts a set of structural features; (ii) a lightweight complexity scorer that maps these features to a continuous complexity score; and (iii) a score-driven routing policy that selects dense RAG for low-score queries, invokes graph-based retrieval for high-score queries, and applies complexity-aware reciprocal rank fusion to handle borderline cases. Extensive experiments on a comprehensive benchmark, consisting of two single-hop and two multi-hop QA benchmarks, demonstrate that our EA-GraphRAG significantly improves accuracy, reduces latency, and achieves state-of-the-art performance in handling mixed scenarios involving both simple and complex queries.

35. 【2602.03563】ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning

链接https://arxiv.org/abs/2602.03563

作者:Wei Zhu

类目:Computation and Language (cs.CL)

关键词:supervised setting, contrastive learning, contrastive learning objective, supervised settings, success in self-supervised

备注

点击查看摘要

Abstract:Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underline{A}ligned \underline{C}ontrastive \underline{L}earning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples' representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.

36. 【2602.03560】HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

链接https://arxiv.org/abs/2602.03560

作者:Yizhao Gao,Jianyu Wei,Qihao Zhang,Yu Cheng,Shimao Chen,Zhengju Tang,Zihan Jiang,Yifan Song,Hailin Zhang,Liang Zhao,Bo Yang,Gang Wang,Shijie Cao,Fuli Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Sparse Attention, full attention, Attention, full attention layer, work introduces Hybrid

备注: 17 pages, 2 figures

点击查看摘要

Abstract:This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

37. 【2602.03554】When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

链接https://arxiv.org/abs/2602.03554

作者:Bogdan Zagribelnyy,Ivan Ilin,Maksim Kuznetsov,Nikita Bondarev,Roman Schutski,Thomas MacDougall,Rim Shayakhmetov,Zulfat Miftakhutdinov,Mikolaj Mizera,Vladimir Aladinskiy,Alex Aliper,Alex Zhavoronkov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:Recent progress, drug discovery, large language models, including synthesis planning, progress has expanded

备注

点击查看摘要

Abstract:Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

38. 【2602.03551】Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

链接https://arxiv.org/abs/2602.03551

作者:Vitalii Hirak,Jaap Jumelet,Arianna Bisazza

类目:Computation and Language (cs.CL)

关键词:quality disparities persist, major advances, disparities persist, large quality disparities, typological properties

备注: 19 pages, 11 figures, EACL 2026

点击查看摘要

Abstract:Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.

39. 【2602.03548】SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue

链接https://arxiv.org/abs/2602.03548

作者:Yuqin Dai,Ning Gao,Wei Zhang,Jie Wang,Zichen Luo,Jinpeng Wang,Yujie Wang,Ruiyuan Wu,Chaozheng Wang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, demonstrated remarkable

备注

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: this https URL.

40. 【2602.03542】Can Large Language Models Generalize Procedures Across Representations?

链接https://arxiv.org/abs/2602.03542

作者:Fangru Lin,Valentin Hofmann,Xingchen Wan,Weixing Wang,Zifeng Ding,Anthony G. Cohn,Janet B. Pierrehumbert

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language, Large language models, real-world user tasks, natural language, tested extensively

备注

点击查看摘要

Abstract:Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.

41. 【2602.03507】Learning to Reason Faithfully through Step-Level Faithfulness Maximization

链接https://arxiv.org/abs/2602.03507

作者:Runquan Gui,Yafu Li,Xiaoye Qu,Ziyan Liu,Yeqiu Cheng,Yu Cheng

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, tasks requiring multi-step, requiring multi-step reasoning

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at this https URL.

42. 【2602.03491】Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

链接https://arxiv.org/abs/2602.03491

作者:Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Youcheng Pan,Xiaoqiang Zhou,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Vision-Language Models, images remains challenging, Vision-Language Models, challenging for Large, Large Vision-Language

备注

点击查看摘要

Abstract:Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.

43. 【2602.03485】Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

链接https://arxiv.org/abs/2602.03485

作者:Quanyu Long,Kai Jie Jiang,Jianda Chen,Xu Guo,Leilei Gan,Wenya Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:achieve strong performance, Large Reasoning Models, generating long reasoning, long reasoning traces, Large Reasoning

备注: 19 pages, 8 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.

44. 【2602.03484】Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish

链接https://arxiv.org/abs/2602.03484

作者:Jenny Kunz

类目:Computation and Language (cs.CL)

关键词:linguistically acceptable, textit, Swedish, models develop preferences, develop preferences

备注: Accepted to TACL. Note that the arXiv version is a pre-MIT Press publication version

点击查看摘要

Abstract:In this study, we investigate how language models develop preferences for \textit{idiomatic} as compared to \textit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.

45. 【2602.03442】A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

链接https://arxiv.org/abs/2602.03442

作者:Mingxuan Du,Benfeng Xu,Chiwei Zhu,Shaohan Wang,Pengyu Wang,Xiaorui Wang,Zhendong Mao

类目:Computation and Language (cs.CL)

关键词:Frontier language models, demonstrated strong reasoning, Frontier language, long-horizon tool-use capabilities, demonstrated strong

备注: 18 pages, 8 figures

点击查看摘要

Abstract:Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model's input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at this https URL.

46. 【2602.03429】DiscoverLLM: From Executing Intents to Discovering Them

链接https://arxiv.org/abs/2602.03429

作者:Tae Soo Kim,Yoonjoo Lee,Jaesang Yu,John Joon Young Chung,Juho Kim

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, open-ended requests, clarification questions

备注

点击查看摘要

Abstract:To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

47. 【2602.03419】SWE-World: Building Software Engineering Agents in Docker-Free Environments

链接https://arxiv.org/abs/2602.03419

作者:Shuang Sun,Huatong Song,Lisheng Huang,Jinhao Jiang,Ran Le,Zhihao Lv,Zongchao Chen,Yiwen Hu,Wenyang Luo,Wayne Xin Zhao,Yang Song,Hongteng Xu,Tao Zhang,Ji-Rong Wen

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Recent advances, large language models, complex code modification, tackle complex code, advances in large

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled software engineering agents to tackle complex code modification tasks. Most existing approaches rely on execution feedback from containerized environments, which require dependency-complete setup and physical execution of programs and tests. While effective, this paradigm is resource-intensive and difficult to maintain, substantially complicating agent training and limiting scalability. We propose SWE-World, a Docker-free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents. SWE-World leverages LLM-based models trained on real agent-environment interaction data to predict intermediate execution outcomes and final test feedback, enabling agents to learn without interacting with physical containerized environments. This design preserves the standard agent-environment interaction loop while eliminating the need for costly environment construction and maintenance during agent optimization and evaluation. Furthermore, because SWE-World can simulate the final evaluation outcomes of candidate trajectories without real submission, it enables selecting the best solution among multiple test-time attempts, thereby facilitating effective test-time scaling (TTS) in software engineering tasks. Experiments on SWE-bench Verified demonstrate that SWE-World raises Qwen2.5-Coder-32B from 6.2\% to 52.0\% via Docker-free SFT, 55.0\% with Docker-free RL, and 68.2\% with further TTS. The code is available at this https URL

48. 【2602.03417】FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

链接https://arxiv.org/abs/2602.03417

作者:Yingli Shen,Wen Lai,Jie Zhou,Xueren Zhang,Yudong Wang,Kangyang Luo,Shuo Wang,Ge Gao,Alexander Fraser,Maosong Sun

类目:Computation and Language (cs.CL)

关键词:exhibit remarkable fluency, LLMs exhibit remarkable, remarkable fluency, traceable provenance, LLMs exhibit

备注

点击查看摘要

Abstract:While LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance. Existing resources for grounding mitigate this but typically enforce a dichotomy: they offer either structured knowledge without textual context (e.g., knowledge bases) or grounded text with limited scale and linguistic coverage. To bridge this gap, we introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions. Unlike recent synthetic approaches, FactNet employs a strictly deterministic construction pipeline, ensuring that every evidence unit is recoverable with byte-level precision. Extensive auditing confirms a high grounding precision of 92.1%, even in long-tail languages. Furthermore, we establish FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking. FactNet provides the community with a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems.

49. 【2602.03412】Verified Critical Step Optimization for LLM Agents

链接https://arxiv.org/abs/2602.03412

作者:Mukai Li,Qingcheng Zeng,Tianqing Fang,Zhenwen Liang,Linfeng Song,Qi Liu,Haitao Mi,Dong Yu

类目:Computation and Language (cs.CL)

关键词:tackle increasingly complex, increasingly complex long-horizon, complex long-horizon tasks, Monte Carlo sampling, large language model

备注: Working in progress

点击查看摘要

Abstract:As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.

50. 【2602.03411】SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

链接https://arxiv.org/abs/2602.03411

作者:Huatong Song,Lisheng Huang,Shuang Sun,Jinhao Jiang,Ran Le,Daixuan Cheng,Guoxin Chen,Yiwen Hu,Zongchao Chen,Wayne Xin Zhao,Yang Song,Tao Zhang,Ji-Rong Wen

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:building effective software, fully reproducible post-training, technical report, building effective, software engineering agents

备注

点击查看摘要

Abstract:In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents. SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design. Starting from an open-source base model with limited initial SWE capability, SWE-Master demonstrates how systematical optimization method can elicit strong long-horizon SWE task solving abilities. We evaluate SWE-Master on SWE-bench Verified, a standard benchmark for realistic software engineering tasks. Under identical experimental settings, our approach achieves a resolve rate of 61.4\% with Qwen2.5-Coder-32B, substantially outperforming existing open-source baselines. By further incorporating test-time scaling~(TTS) with LLM-based environment feedback, SWE-Master reaches 70.8\% at TTS@8, demonstrating a strong performance potential. SWE-Master provides a practical and transparent foundation for advancing reproducible research on software engineering agents. The code is available at this https URL.

51. 【2602.03396】owards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

链接https://arxiv.org/abs/2602.03396

作者:Hao Fang,Tianyi Zhang,Tianqu Zhuang,Jiawei Kong,Kuofeng Gao,Bin Chen,Leqi Liang,Shu-Tao Xia,Ke Xu

类目:Computation and Language (cs.CL)

关键词:Proprietary large language, embody substantial economic, Proprietary large, large language models, embody substantial

备注

点击查看摘要

Abstract:Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.

52. 【2602.03368】Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain

链接https://arxiv.org/abs/2602.03368

作者:Wei Zhu

类目:Computation and Language (cs.CL)

关键词:industrial applications based, retrieval augmented generation, large language models, industrial applications, RAG system

备注

点击查看摘要

Abstract:While retrieval augmented generation (RAG) has been swiftly adopted in industrial applications based on large language models (LLMs), there is no consensus on what are the best practices for building a RAG system in terms of what are the components, how to organize these components and how to implement each component for the industrial applications, especially in the medical domain. In this work, we first carefully analyze each component of the RAG system and propose practical alternatives for each component. Then, we conduct systematic evaluations on three types of tasks, revealing the best practices for improving the RAG system and how LLM-based RAG systems make trade-offs between performance and efficiency.

53. 【2602.03359】MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

链接https://arxiv.org/abs/2602.03359

作者:Ning Ding,Fangcheng Liu,Kyungrae Kim,Linji Hao,Kyeng-Hun Lee,Hyeonmok Ko,Yehui Tang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Scaling Large Language, Large Language, Language Models, Scaling Large

备注

点击查看摘要

Abstract:Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at this https URL.

54. 【2602.03358】GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

链接https://arxiv.org/abs/2602.03358

作者:Junmo Cho,Suhan Kim,Sangjune An,Minsu Kim,Dong Bok Lee,Heejun Lee,Sung Ju Hwang,Hae Beom Lee

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Finding effective prompts, Finding effective, expensive target-LM evaluation, language models, notoriously difficult

备注

点击查看摘要

Abstract:Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.

55. 【2602.03352】PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

链接https://arxiv.org/abs/2602.03352

作者:Yunzhi Shen,Hao Zhou,Xin Huang,Xue Han,Junlan Feng,Shujian Huang

类目:Computation and Language (cs.CL)

关键词:Monte Carlo return, GRPO demonstrating notable, Monte Carlo, noisy learning signals, learning signals arising

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).

56. 【2602.03344】Robustness as an Emergent Property of Task Performance

链接https://arxiv.org/abs/2602.03344

作者:Shir Ashury-Tahan,Ariel Gera,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:critical future challenge, stability is essential, critical future, future challenge, Robustness

备注

点击查看摘要

Abstract:Robustness is often regarded as a critical future challenge for real-world applications, where stability is essential. However, as models often learn tasks in a similar order, we hypothesize that easier tasks will be easier regardless of how they are presented to the model. Indeed, in this paper, we show that as models approach high performance on a task, robustness is effectively achieved. Through an empirical analysis of multiple models across diverse datasets and configurations (e.g., paraphrases, different temperatures), we find a strong positive correlation. Moreover, we find that robustness is primarily driven by task-specific competence rather than inherent model-level properties, challenging current approaches that treat robustness as an independent capability. Thus, from a high-level perspective, we may expect that as new tasks saturate, model robustness on these tasks will emerge accordingly. For researchers, this implies that explicit efforts to measure and improve robustness may warrant reduced emphasis, as such robustness is likely to develop alongside performance gains. For practitioners, it acts as a sign that indeed the tasks that the literature deals with are unreliable, but on easier past tasks, the models are reliable and ready for real-world deployment.

57. 【2602.03338】Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

链接https://arxiv.org/abs/2602.03338

作者:Rakshith Vasudev,Melisa Russak,Dan Bikel,Waseem Alshikh

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLM critic models, LLM critic, LLM critic accuracy, binary LLM critic, Proactive interventions

备注

点击查看摘要

Abstract:Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.03338 [cs.CL]

(or
arXiv:2602.03338v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.03338

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
58. 【2602.03318】MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

链接https://arxiv.org/abs/2602.03318

作者:Yifan Shi,Jialong Shi,Jiayi Wang,Ye Fan,Jianyong Sun

类目:Computation and Language (cs.CL)

关键词:Operations Research, expert-driven modeling-a slow, fragile process ill-suited, relies on expert-driven, expert-driven modeling-a

备注

点击查看摘要

Abstract:Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.

59. 【2602.03300】R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

链接https://arxiv.org/abs/2602.03300

作者:Jingyi Zhang,Tianyi Lin,Huanjin Yao,Xiang Lan,Shunyu Liu,Jiaxing Huang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Collective Adversarial Data, complex real-world tasks, solving complex real-world, effective data synthesis, data synthesis techniques

备注

点击查看摘要

Abstract:In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

60. 【2602.03295】POP: Prefill-Only Pruning for Efficient Large Model Inference

链接https://arxiv.org/abs/2602.03295

作者:Junhui He,Zhihui Fu,Jun Wang,Qingan Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, remarkable capabilities

备注

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

61. 【2602.03237】Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations

链接https://arxiv.org/abs/2602.03237

作者:Yuxuan Yao,Haonan Sheng,Qingsong Lv,Han Wu,Shuqi Liu,Zehua Liu,Zengyan Liu,Jiahui Gao,Haochen Tan,Xiaojin Fu,Haoli Bai,Hing Cheung So,Zhijiang Guo,Linqi Song

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, ARM, textbf

备注

点击查看摘要

Abstract:The escalating scale of Large Language Models (LLMs) necessitates efficient adaptation techniques. Model merging has gained prominence for its efficiency and controllability. However, existing merging techniques typically serve as post-hoc refinements or focus on mitigating task interference, often failing to capture the dynamic optimization benefits of supervised fine-tuning (SFT). In this work, we propose Streaming Merging, an innovative model updating paradigm that conceptualizes merging as an iterative optimization process. Central to this paradigm is \textbf{ARM} (\textbf{A}ctivation-guided \textbf{R}otation-aware \textbf{M}erging), a strategy designed to approximate gradient descent dynamics. By treating merging coefficients as learning rates and deriving rotation vectors from activation subspaces, ARM effectively steers parameter updates along data-driven trajectories. Unlike conventional linear interpolation, ARM aligns semantic subspaces to preserve the geometric structure of high-dimensional parameter evolution. Remarkably, ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model. Experimental results across model scales (1.7B to 14B) and diverse domains (e.g., math, code) demonstrate that ARM can transcend converged checkpoints. Extensive experiments show that ARM provides a scalable and lightweight framework for efficient model adaptation.

62. 【2602.03226】ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs

链接https://arxiv.org/abs/2602.03226

作者:Xuancheng Li,Haitao Li,Yujia Zhou,Qingyao Ai,Yiqun Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, language models, large language, due to excessive, critical information

备注

点击查看摘要

Abstract:Long-context inputs in large language models (LLMs) often suffer from the "lost in the middle" problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor.

63. 【2602.03216】oken Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

链接https://arxiv.org/abs/2602.03216

作者:Dongwon Jo,Beomseok Kang,Jiwon Song,Jae-Joon Kim

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, Token Sparse Attention, Sparse Attention, Token Sparse, attention

备注

点击查看摘要

Abstract:The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

64. 【2602.03203】ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

链接https://arxiv.org/abs/2602.03203

作者:Zican Dong,Peiyu Liu,Junyi Li,Zhipeng Chen,Han Peng,Shuo Wang,Wayne Xin Zhao

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:shown remarkable reasoning, remarkable reasoning abilities, producing long reasoning, shown remarkable, abilities by producing

备注

点击查看摘要

Abstract:Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.

65. 【2602.03190】Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

链接https://arxiv.org/abs/2602.03190

作者:Wenquan Lu,Hai Huang,Randall Balestriero

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrated strong potential, Reinforcement learning algorithms, group-relative policy optimization, large language models, learning algorithms

备注

点击查看摘要

Abstract:Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at this https URL.

66. 【2602.03184】DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

链接https://arxiv.org/abs/2602.03184

作者:Jiancai Ye,Jun Liu,Qingchen Li,Tianlang Zhao,Hanbin Zhang,Jiayi Pan,Ningyi Xu,Guohao Dai

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, efficient large language, Cache is essential, language models, growing memory footprint

备注

点击查看摘要

Abstract:Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.

67. 【2602.03183】Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

链接https://arxiv.org/abs/2602.03183

作者:Hyunwoo Kim,Niloofar Mireshghallah,Michael Duan,Rui Xin,Shuyue Stella Li,Jaehun Jung,David Acuna,Qi Pang,Hanshen Xiao,G. Edward Suh,Sewoong Oh,Yulia Tsvetkov,Pang Wei Koh,Yejin Choi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:standing in sharp, sharp contrast, Research involving privacy-sensitive, involving privacy-sensitive data, Research involving

备注: For code and data, see [this https URL](https://privasis.github.io)

点击查看摘要

Abstract:Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

68. 【2602.03160】VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

链接https://arxiv.org/abs/2602.03160

作者:Woojin Kim,Sieun Hyeon,Jusang Oh,Jaeyoung Do

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Aligning Large Language, Large Language Models, Aligning Large, Large Language, deeper motivational principles

备注

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

69. 【2602.03152】FASA: Frequency-aware Sparse Attention

链接https://arxiv.org/abs/2602.03152

作者:Yifei Wang,Yueqi Wang,Zhenrui Yue,Huimin Zeng,Yong Wang,Ismini Lourentzou,Zhengzhong Tu,Xiangxiang Chu,Julian McAuley

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, handling lengthy inputs, deployment of Large

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.

70. 【2602.03143】Self-Hinting Language Models Enhance Reinforcement Learning

链接https://arxiv.org/abs/2602.03143

作者:Baohao Liao,Hanze Dong,Xinxing Xu,Christof Monz,Jiang Bian

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:Relative Policy Optimization, aligning large language, Group Relative Policy, large language models, Policy Optimization

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at this https URL.

71. 【2602.03141】Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

链接https://arxiv.org/abs/2602.03141

作者:Runquan Gui,Jie Wang,Zhihai Wang,Chi Ma,Jianye Hao,Feng Wu

类目:Computation and Language (cs.CL)

关键词:verbose generation results, demonstrated impressive capabilities, solving complex tasks, Large Reasoning Models, textbf

备注

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.

72. 【2602.03109】One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

链接https://arxiv.org/abs/2602.03109

作者:Bowen Jiang,Taiwei Shi,Ryo Kamoi,Yuan Yuan,Camillo J. Taylor,Longqi Yang,Pei Zhou,Sihao Chen

类目:Computation and Language (cs.CL)

关键词:multi-agent conversational self-play, paper introduces OMAR, reinforcement learning framework, introduces OMAR, multi-agent conversational

备注

点击查看摘要

Abstract:This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.

73. 【2602.03108】ChemPro: A Progressive Chemistry Benchmark for Large Language Models

链接https://arxiv.org/abs/2602.03108

作者:Aaditya Baranwal,Shruti Vyas

类目:Computation and Language (cs.CL)

关键词:Large Language Models, natural language question-answer, language question-answer pairs, proficiency of Large, Multiple Choice Questions

备注

点击查看摘要

Abstract:We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.03108 [cs.CL]

(or
arXiv:2602.03108v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.03108

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
74. 【2602.03107】he Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

链接https://arxiv.org/abs/2602.03107

作者:Yitong Zhang,Yuhan Xiang,Mingxuan Liu

类目:Computation and Language (cs.CL)

关键词:mock politeness phenomena, large language models, representative large language, study systematically evaluates, mock politeness

备注: Preprint

点击查看摘要

Abstract:From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,'' offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.

75. 【2602.03103】ask--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

链接https://arxiv.org/abs/2602.03103

作者:Pritam Kadasi,Abhishek Upperwal,Mayank Singh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, adapt large language, train and adapt, adapt large, Specificity Score

备注

点击查看摘要

Abstract:Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.03103 [cs.CL]

(or
arXiv:2602.03103v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.03103

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
76. 【2602.03094】st-time Recursive Thinking: Self-Improvement without External Feedback

链接https://arxiv.org/abs/2602.03094

作者:Yufan Zhuang,Chandan Singh,Liyuan Liu,Yelong Shen,Dinghuai Zhang,Jingbo Shang,Jianfeng Gao,Weizhu Chen

类目:Computation and Language (cs.CL)

关键词:Modern Large Language, Large Language Models, Modern Large, Large Language, shown rapid improvements

备注

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.

77. 【2602.03084】AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

链接https://arxiv.org/abs/2602.03084

作者:Zhitao Gao,Jie Ma,Xuhong Li,Pengyu Li,Ning Qu,Yaqiang Wu,Hui Liu,Jun Liu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, achieved significant success, external verifiers

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap'' and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57\% on Qwen3-4B-Base and 5.10\% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at this https URL.

78. 【2602.03075】ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

链接https://arxiv.org/abs/2602.03075

作者:Junjie Huang,Jiarui Qin,Di Yin,Weiwen Liu,Yong Yu,Xing Sun,Weinan Zhang

类目:Computation and Language (cs.CL)

关键词:Standard training pipelines, large language models, Standard training, large language, Standard

备注: 25 pages

点击查看摘要

Abstract:Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

79. 【2602.03059】From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

链接https://arxiv.org/abs/2602.03059

作者:Yoonsang Kim,Divyansh Pradhan,Devshree Jadeja,Arie Kaufman

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)

关键词:converts verbal remote-assistance, framework that converts, spatially grounded, referent disambiguation framework, verbal remote-assistance instructions

备注: 11 pages, 6 figures. This is the author's version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026

点击查看摘要

Abstract:We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

80. 【2602.03053】MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

链接https://arxiv.org/abs/2602.03053

作者:Vishal Venkataramani,Haizhou Shi,Zixuan Ke,Austin Xu,Xiaoxiao He,Yingbo Zhou,Semih Yavuz,Hao Wang,Shafiq Joty

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Large Language Models, Large Language, built on Large, Language Models, MAS

备注: Preprint; work in progress

点击查看摘要

Abstract:Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at this https URL.

81. 【2602.03051】SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

链接https://arxiv.org/abs/2602.03051

作者:Xing Hu,Dawei Yang,Yuan Cheng,Zhixuan Chen,Zukang Xu

类目:Computation and Language (cs.CL)

关键词:large language models, efficient compression techniques, Error Suppression SVD, language models, rapid growth

备注

点击查看摘要

Abstract:The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: (1) Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors. (2) Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CEALC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or mixed-rank strategies, SAES-SVD consistently improves post-compression performance.

82. 【2602.03036】LatentMem: Customizing Latent Memory for Multi-Agent Systems

链接https://arxiv.org/abs/2602.03036

作者:Muxin Fu,Guibin Zhang,Xiangyuan Xue,Yafu Li,Zefeng He,Siyuan Huang,Xiaoye Qu,Yu Cheng,Yang Yang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Large language model, demonstrate remarkable collective, remarkable collective intelligence, powered multi-agent systems, Large language

备注

点击查看摘要

Abstract:Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.

83. 【2602.03025】RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents

链接https://arxiv.org/abs/2602.03025

作者:Haitian Zhong,Jixiu Zhai,Lei Song,Jiang Bian,Qiang Liu,Tieniu Tan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, challenging for Large, Relative Policy Optimization, Multi-turn tool calling

备注

点击查看摘要

Abstract:Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., |high_reward|, |low_reward|) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.

84. 【2602.02979】CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

链接https://arxiv.org/abs/2602.02979

作者:Ran Li,Zeyuan Liu,Yinghao chen,Bingxiang He,Jiarui Yuan,Zixuan Fu,Weize Chen,Jinyi Hu,Zhiyuan Liu,Maosong Sun

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated strong potential, progress remains fundamentally, remains fundamentally constrained

备注: work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.

85. 【2602.02975】Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

链接https://arxiv.org/abs/2602.02975

作者:Mitchell Abrams,Kaveh Eskandari Miandoab,Felix Gervits,Vasanth Sarathy,Matthias Scheutz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, environments where successful, successful communication, communication often depends, constrain what actions

备注: Accepted to the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.

86. 【2602.02951】Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning

链接https://arxiv.org/abs/2602.02951

作者:Yihong Huang,Fei Ma,Yihua Shao,Jingcai Guo,Zitong Yu,Laizhong Cui,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Vision Language Model, Language Model, effective acceleration technique, efficient Vision Language, Vision Language

备注

点击查看摘要

Abstract:Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).

87. 【2602.02932】Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

链接https://arxiv.org/abs/2602.02932

作者:Alireza Amiri-Margavi,Arshia Gharagozlou,Amin Gholami Davodi,Seyed Pouyan Mousavi Davoudi,Hamidreza Hasani Balyani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Prior work, large language models, safety filtering, large language, primarily focused

备注: 13 pages, 1 figure

点击查看摘要

Abstract:Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.

88. 【2602.02898】Aligning Language Model Benchmarks with Pairwise Preferences

链接https://arxiv.org/abs/2602.02898

作者:Marco Gutierrez,Xinyi Leng,Hannah Cyberey,Jonathan Richard Schwarz,Ahmed Alaa,Thomas Hartvigsen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:pervasive and computationally-efficient, computationally-efficient proxies, proxies for real-world, benchmarks, models

备注

点击查看摘要

Abstract:Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

89. 【2602.02891】raceNAS: Zero-shot LLM Pruning via Gradient Trace Correlation

链接https://arxiv.org/abs/2602.02891

作者:Prajna G. Malettira,Manish Nagaraj,Arjun Roy,Shubham Negi,Kaushik Roy

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, deployment of Large, Language Models, Structured pruning

备注: Preprint

点击查看摘要

Abstract:Structured pruning is essential for efficient deployment of Large Language Models (LLMs). The varying sensitivity of LLM sub-blocks to pruning necessitates the identification of optimal non-uniformly pruned models. Existing methods evaluate the importance of layers, attention heads, or weight channels in isolation. Such localized focus ignores the complex global structural dependencies that exist across the model. Training-aware structured pruning addresses global dependencies, but its computational cost can be just as expensive as post-pruning training. To alleviate the computational burden of training-aware pruning and capture global structural dependencies, we propose TraceNAS, a training-free Neural Architecture Search (NAS) framework that jointly explores structured pruning of LLM depth and width. TraceNAS identifies pruned models that maintain a high degree of loss landscape alignment with the pretrained model using a scale-invariant zero-shot proxy, effectively selecting models that exhibit maximal performance potential during post-pruning training. TraceNAS is highly efficient, enabling high-fidelity discovery of pruned models on a single GPU in 8.5 hours, yielding a 10$\times$ reduction in GPU-hours compared to training-aware methods. Evaluations on the Llama and Qwen families demonstrate that TraceNAS is competitive with training-aware baselines across commonsense and reasoning benchmarks.

90. 【2602.02888】HALT: Hallucination Assessment via Log-probs as Time series

链接https://arxiv.org/abs/2602.02888

作者:Ahmad Shapiro,Karan Taneja,Ashok Goel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Time series, remain a major, major obstacle, large language models, HALT

备注

点击查看摘要

Abstract:Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.

91. 【2602.02878】Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs

链接https://arxiv.org/abs/2602.02878

作者:Junyi Jessy Li,Yang Janet Liu,Kanishka Misra,Valentina Pyatkin,William Sheffield

类目:Computation and Language (cs.CL)

关键词:field of NLP, NLP has undergone, undergone vast, continuous transformations, past few years

备注: accepted to the TeachNLP 2026 workshop (co-located with EACL 2026), camera-ready, 14 pages

点击查看摘要

Abstract:The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.

92. 【2602.02843】Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

链接https://arxiv.org/abs/2602.02843

作者:Polina Tsvilodub,Karl Mulligan,Todd Snider,Robert D. Hawkins,Michael Franke

类目:Computation and Language (cs.CL)

关键词:choose to act, act, uncertainty, reduce uncertainty, act to reduce

备注: 6 pages, 3 figures, under review

点击查看摘要

Abstract:When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty. In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.

93. 【2602.02842】Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing

链接https://arxiv.org/abs/2602.02842

作者:Saeid Sheikhi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Chain of Simulation, Large Language, present Chain, strategies in Large

备注

点击查看摘要

Abstract:We present Chain of Simulation (CoS), a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies in Large Language Models (LLMs). Unlike existing uniform prompting approaches, CoS employs three distinct reasoning modes: (1) computational flow with self-consistency for mathematical problems, (2) symbolic state tracking with JSON representations for spatial reasoning, and (3) hybrid fact-extraction for multi-hop inference. Through comprehensive evaluation on GSM8K, StrategyQA, and bAbI benchmarks using four state-of-the-art models (Gemma-3 27B, LLaMA-3.1 8B, Mistral 7B, and Qwen-2.5 14B), we demonstrate that CoS achieves 71.5% accuracy on GSM8K (1.0% absolute improvement), 90.0% on StrategyQA (2.5% improvement), and 19.0% on bAbI (65.2% relative improvement) compared to the strongest baselines. The analysis reveals that problem-specific mode selection is crucial, with computational mode achieving 81.2% accuracy when correctly applied to mathematical problems, while misrouting leads to 0% accuracy. We provide detailed algorithms for mode selection, state tracking, and answer extraction, establishing CoS as an effective approach for improving LLM reasoning without additional training. The framework provides superior trade-offs between accuracy and efficiency compared to Self-Consistency, achieving comparable performance at 54% lower computational cost.

94. 【2602.02824】CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

链接https://arxiv.org/abs/2602.02824

作者:Zhengbang Yang,Yisheng Zhong,Junyuan Hong,Zhuangdi Zhu

类目:Computation and Language (cs.CL)

关键词:Pretrained knowledge memorized, LLMs raises critical, motivated LLM Unlearning, Negative Preference Alignment, motivated LLM

备注

点击查看摘要

Abstract:Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.

95. 【2602.02823】R2-Router: A New Paradigm for LLM Routing with Reasoning

链接https://arxiv.org/abs/2602.02823

作者:Jiaqi Xue,Qian Lou,Jiarong Xing,Heng Huang

类目:Computation and Language (cs.CL)

关键词:LLM, LLM quality, emerged by learning, learning to predict, quality

备注

点击查看摘要

Abstract:As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget.

96. 【2602.02821】When Efficient Communication Explains Convexity

链接https://arxiv.org/abs/2602.02821

作者:Ashvin Ranjan,Shane Steinert-Threlkeld

类目:Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:optimally balancing competing, balancing competing pressures, efficient communication, recent work, work has argued

备注

点击查看摘要

Abstract:Much recent work has argued that the variation in the languages of the world can be explained from the perspective of efficient communication; in particular, languages can be seen as optimally balancing competing pressures to be simple and to be informative. Focusing on the expression of meaning -- semantic typology -- the present paper asks what factors are responsible for successful explanations in terms of efficient communication. Using the Information Bottleneck (IB) approach to formalizing this trade-off, we first demonstrate and analyze a correlation between optimality in the IB sense and a novel generalization of convexity to this setting. In a second experiment, we manipulate various modeling parameters in the IB framework to determine which factors drive the correlation between convexity and optimality. We find that the convexity of the communicative need distribution plays an especially important role. These results move beyond showing that efficient communication can explain aspects of semantic typology into explanations for why that is the case by identifying which underlying factors are responsible.

97. 【2602.02774】AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic

链接https://arxiv.org/abs/2602.02774

作者:Israel Abebe Azime,Abenezer Kebede Angamo,Hana Mekonen Tamiru,Dagnachew Mekonnen Marilign,Philipp Slusallek,Seid Muhie Yimam,Dietrich Klakow

类目:Computation and Language (cs.CL)

关键词:large language models, treated as synonymous, growing emphasis, emphasis on multilingual, performance is commonly

备注

点击查看摘要

Abstract:With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf{\textit{AmharicStoryQA}}, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.

98. 【2602.02766】Privately Fine-Tuned LLMs Preserve Temporal Dynamics in Tabular Data

链接https://arxiv.org/abs/2602.02766

作者:Lucas Rosenblatt,Peihan Liu,Ryan McKenna,Natalia Ponomareva

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:differentially private synthetic, private synthetic tabular, identically distributed rows, Research on differentially, synthetic tabular data

备注

点击查看摘要

Abstract:Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into high-dimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.

99. 【2602.02760】From Task Solving to Robust Real-World Adaptation in LLM Agents

链接https://arxiv.org/abs/2602.02760

作者:Pouya Pezeshkpour,Estevam Hruschka

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:increasingly deployed, deployed as specialized, Large language models, call tools, Large language

备注

点击查看摘要

Abstract:Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a "clean interface" where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective-often overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi-stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress-test deployment-relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution. Episodes violate clean-interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state-of-the-art LLM agents, we find large gaps between nominal task-solving and deployment-like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model-specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity.

100. 【2602.02751】Scaling Small Agents Through Strategy Auctions

链接https://arxiv.org/abs/2602.02751

作者:Lisa Alazraki,William F. Shen,Yoram Bachrach,Akhil Mathur

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Small language models, cost-effective approach, increasingly viewed, proponents claiming, sufficiently capable

备注

点击查看摘要

Abstract:Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 53%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost -- often both -- underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

101. 【2602.02736】me-Critical Multimodal Medical Transportation: Organs, Patients, and Medical Supplies

链接https://arxiv.org/abs/2602.02736

作者:Elaheh Sabziyan Varnousfaderani,Syed A. M. Shihab,Mohammad Taghizadeh

类目:Computation and Language (cs.CL)

关键词:Unmanned Aerial Vehicles, severely impact outcomes, Unmanned Aerial, Aerial Vehicles, Timely transportation

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Timely transportation of organs, patients, and medical supplies is critical to modern healthcare, particularly in emergencies and transplant scenarios where even short delays can severely impact outcomes. Traditional ground-based vehicles such as ambulances are often hindered by traffic congestion; while air vehicles such as helicopters are faster but costly. Emerging air vehicles -- Unmanned Aerial Vehicles and electric vertical take-off and landing aircraft -- have lower operating costs, but remain limited by range and susceptibility to weather conditions. A multimodal transportation system that integrates both air and ground vehicles can leverage the strengths of each to enhance overall transportation efficiency. This study introduces a constructive greedy heuristic algorithm for multimodal vehicle dispatching for medical transportation. Four different fleet configurations were tested: (i) ambulances only, (ii) ambulances with Unmanned Aerial Vehicles, (iii) ambulances with electric vertical take-off and landing aircraft, and (iv) a fully integrated fleet of ambulances, Unmanned Aerial Vehicles, and electric vertical take-off and landing aircraft. The algorithm incorporates payload consolidation across compatible routes, accounts for traffic congestion in ground operations and weather conditions in aerial operations, while enabling rapid vehicle dispatching compared to computationally intensive optimization models. Using a common set of conditions, we evaluate all four fleet types to identify the most effective configurations for fulfilling medical transportation needs while minimizing operating costs, recharging/fuel costs, and total transportation time.

102. 【2602.02731】Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors

链接https://arxiv.org/abs/2602.02731

作者:Rohan Pandey,Haijuan Yan,Hong Yu,Jack Tsai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:public health challenge, critical public health, risk prediction offers, proactive intervention, Veterans Affairs patients

备注

点击查看摘要

Abstract:Homelessness among US veterans remains a critical public health challenge, yet risk prediction offers a pathway for proactive intervention. In this retrospective prognostic study, we analyzed electronic health record (EHR) data from 4,276,403 Veterans Affairs patients during a 2016 observation period to predict first-episode homelessness occurring 3-12 months later in 2017 (prevalence: 0.32-1.19%). We constructed static and time-varying EHR representations, utilizing clinician-informed logic to model the persistence of clinical conditions and social risks over time. We then compared the performance of classical machine learning, transformer-based masked language models, and fine-tuned large language models (LLMs). We demonstrate that incorporating social and behavioral factors into longitudinal models improved precision-recall area under the curve (PR-AUC) by 15-30%. In the top 1% risk tier, models yielded positive predictive values ranging from 3.93-4.72% at 3 months, 7.39-8.30% at 6 months, 9.84-11.41% at 9 months, and 11.65-13.80% at 12 months across model architectures. Large language models underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups. These results demonstrate that longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted and data-informed prevention strategies for at-risk veterans.

103. 【2602.02726】Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

链接https://arxiv.org/abs/2602.02726

作者:Xuemin Yu,Ankur Garg,Samira Ebrahimi Kahou,Hassan Sajjad

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Deep Learning models, Deep Learning, Learning models encode, encode rich semantic, models encode rich

备注

点击查看摘要

Abstract:Deep Learning models encode rich semantic information in their hidden representations. However, it remains challenging to understand which parts of this information models actually rely on when making predictions. A promising line of post-hoc concept-based explanation methods relies on clustering token representations. However, commonly used approaches such as hierarchical clustering are computationally infeasible for large-scale datasets, and K-Means often yields shallow or frequency-dominated clusters. We propose the vector quantized latent concept (VQLC) method, a framework built upon the vector quantized-variational autoencoder (VQ-VAE) architecture that learns a discrete codebook mapping continuous representations to concept vectors. We perform thorough evaluations and show that VQLC improves scalability while maintaining comparable quality of human-understandable explanations.

104. 【2602.02712】owards Understanding Steering Strength

链接https://arxiv.org/abs/2602.02712

作者:Magamed Taimeskhanov,Samuel Vaiter,Damien Garreau

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:intermediate latent representations, popular approach, approach to post-training, post-training control, control of large

备注: 33 pages (including appendix)

点击查看摘要

Abstract:A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

105. 【2602.02708】BinaryPPO: Efficient Policy Optimization for Binary Classification

链接https://arxiv.org/abs/2602.02708

作者:Punya Syon Pandey,Zhijing Jin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:factuality verification, toxicity detection, causal inference, standard approach, binary classification tasks

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain-specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in-depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence-based reward design provides a robust alternative to SFT for binary classification. Our code is available at this https URL.

106. 【2602.02704】InfMem: Learning System-2 Memory Control for Long-Context Agent

链接https://arxiv.org/abs/2602.02704

作者:Xinyu Wang,Mingze Li,Peng Lu,Xiao-Wen Chang,Lifeng Shang,Jinping Li,Fei Mi,Prasanna Parthasarathi,Yufei Cui

类目:Computation and Language (cs.CL)

关键词:documents requires synthesizing, requires synthesizing sparse, strict memory constraints, synthesizing sparse evidence, sparse evidence scattered

备注

点击查看摘要

Abstract:Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by $3.9\times$ on average (up to $5.1\times$) via adaptive early stopping.

107. 【2602.02686】Monotonicity as an Architectural Bias for Robust Language Models

链接https://arxiv.org/abs/2602.02686

作者:Patrick Cooper,Alireza Nadali,Ashutosh Trivedi,Alvaro Velasquez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:language models, Large language models, exhibit brittle behavior, neural language models, Transformer-based language models

备注: 12 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.

Comments:
12 pages, 1 figure

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

MSC classes:
68T50, 68T05

ACMclasses:
I.2.7; I.2.6

Cite as:
arXiv:2602.02686 [cs.CL]

(or
arXiv:2602.02686v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.02686

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
108. 【2602.02636】WideSeek: Advancing Wide Research via Multi-Agent Scaling

链接https://arxiv.org/abs/2602.02636

作者:Ziyang Huang,Haolin Ren,Xiaowei Yuan,Jiawei Wang,Zhongtao Jiang,Kun Xu,Shizhu He,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:synthesizing comprehensive information, Wide Research, Wide Research paradigm, intelligence is evolving, essential for retrieving

备注

点击查看摘要

Abstract:Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

109. 【2602.02635】Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

链接https://arxiv.org/abs/2602.02635

作者:Siyu Li,Chenwei Song,Qi Zhou,Wan Zhou,Xinyi Liu

类目:Computation and Language (cs.CL)

关键词:large language models, integrates structured domain, structured domain knowledge, language models, paper proposes

备注

点击查看摘要

Abstract:This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.

110. 【2602.02605】Fine-Tuning Language Models to Know What They Know

链接https://arxiv.org/abs/2602.02605

作者:Sangjun Park,Elliot Meyerson,Xin Qiu,Risto Miikkulainen

类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)

关键词:component of intelligence, critical component, Metacognition, Evolution Strategy, knowledge

备注: Preprint

点击查看摘要

Abstract:Metacognition is a critical component of intelligence, specifically regarding the awareness of one's own knowledge. While humans rely on shared internal memory for both answering questions and reporting their knowledge state, this dependency in LLMs remains underexplored. This study proposes a framework to measure metacognitive ability $d_{\rm{type2}}'$ using a dual-prompt method, followed by the introduction of Evolution Strategy for Metacognitive Alignment (ESMA) to bind a model's internal knowledge to its explicit behaviors. ESMA demonstrates robust generalization across diverse untrained settings, indicating a enhancement in the model's ability to reference its own knowledge. Furthermore, parameter analysis attributes these improvements to a sparse set of significant modifications.

111. 【2602.02582】Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems

链接https://arxiv.org/abs/2602.02582

作者:Chandan Kumar Sah,Xiaoli Lian,Li Zhang,Tony Xu,Syed Shazaib Shah

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:Large language models, enable powerful zero-shot, broad contextual knowledge, leveraging broad contextual, embedded biases threaten

备注: Accepted at the Second Conference of the International Association for Safe and Ethical Artificial Intelligence, IASEAI26, 14 pages

点击查看摘要

Abstract:Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.

112. 【2602.02538】Enhancing Post-Training Quantization via Future Activation Awareness

链接https://arxiv.org/abs/2602.02538

作者:Zheqi Lv,Zhenxuan Fan,Qi Tian,Wenqiao Zhang,Yueting Zhuang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, compress large language, Post-training quantization, language models, compress large

备注

点击查看摘要

Abstract:Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

113. 【2602.02536】From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation

链接https://arxiv.org/abs/2602.02536

作者:Tianle Gu,Kexin Huang,Lingyu Li,Ruilin Luo,Shiyang Huang,Zongqi Wang,Yujiu Yang,Yan Teng,Yingchun Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:identifying harmful content, harmful content, pivotal for identifying, identifying harmful, Safety

备注

点击查看摘要

Abstract:Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi-head scalar reward model (UniRM). UniRM provides multi-dimensional supervision by assigning attribute-level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40\% of the training data used by leading baselines. Ablations further validate our multi-attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \href{this https URL}{project website}.

114. 【2602.02526】he "Robert Boulton" Singularity: Semantic Tunneling and Manifold Unfolding in Recursive AI

链接https://arxiv.org/abs/2602.02526

作者:Pengyue Hou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Physics (physics.comp-ph)

关键词:generative artificial intelligence, artificial intelligence trained, monitored via Perplexity, recursive synthetic data, stability of generative

备注: Companion paper to [arXiv:2601.11594](https://arxiv.org/abs/2601.11594) . Provides empirical validation of the MNCIS framework in Large Language Models (GPT-2) using a recursive training protocol (N=1500). Includes complete, reproducible Python implementation of Adaptive Spectral Negative Coupling (ASNC) and Effective Rank metrics in the Appendix

点击查看摘要

Abstract:The stability of generative artificial intelligence trained on recursive synthetic data is conventionally monitored via Perplexity (PPL). We demonstrate that PPL is a deceptive metric in context-stabilized regimes (L=128). Using a rigorous sliding-window protocol (N=1500), we identify a novel failure mode termed "Semantic Tunneling." While the Baseline model maintains high grammatical fluency (PPL approx. 83.9), it suffers a catastrophic loss of semantic diversity, converging within seven generations to a single, low-entropy narrative attractor: the "Robert Boulton" Singularity. This phenomenon represents a total collapse of the latent manifold (Global Effective Rank 3.62 - 2.22), where the model discards diverse world knowledge to optimize for statistically safe syntactic templates. To address this, we apply the Multi-Scale Negative Coupled Information Systems (MNCIS) framework recently established in Hou (2026) [arXiv:2601.11594]. We demonstrate that Adaptive Spectral Negative Coupling (ASNC) acts as a topological operator that actively induces "Manifold Unfolding." MNCIS forces the model to expand its effective rank from the anisotropic baseline of 3.62 to a hyper-diverse state of 5.35, effectively constructing an "Artificial Manifold" that resists the gravitational pull of semantic attractors and preserves the long-tail distribution of the training data.

115. 【2602.02518】GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

链接https://arxiv.org/abs/2602.02518

作者:Yuyang Bai,Zhuofeng Li,Ping Nie,Jianwen Xie,Yu Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, real-world knowledge sources, increasingly rely, improve factuality

备注: 15 pages, Project website: [this https URL](https://yuyangbai.com/graphdancer/)

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graph-structured knowledge poses two key challenges: (1) navigating structured, schema-defined relations requires precise function calls rather than similarity-based retrieval, and (2) answering complex questions often demands multi-hop evidence aggregation through iterative information seeking. We propose GraphDancer, a reinforcement learning (RL) framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. To make RL effective for moderate-sized LLMs, we introduce a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories using an easy-to-hard biased sampler. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with either a 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code and models can be found at this https URL .

116. 【2602.02515】CreditAudit: 2D Auditing for LLM Evaluation and Selection

链接https://arxiv.org/abs/2602.02515

作者:Yiliang Song,Hongjun An,Jiangong Xiao,Haofei Zhao,Jiawei Shao,Xuelong Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Leaderboard scores, rising and converging, marginal differences, frontier language models, steadily rising

备注: First update

点击查看摘要

Abstract:Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users' day to day experience, because system prompts, output protocols, and interaction modes evolve under routine iteration, and in agentic multi step pipelines small protocol shifts can trigger disproportionate failures, leaving practitioners uncertain about which model to deploy. We propose CreditAudit, a deployment oriented credit audit framework that evaluates models under a family of semantically aligned and non adversarial system prompt templates across multiple benchmarks, reporting mean ability as average performance across scenarios and scenario induced fluctuation sigma as a stability risk signal, and further mapping volatility into interpretable credit grades from AAA to BBB via cross model quantiles with diagnostics that mitigate template difficulty drift. Controlled experiments on GPQA, TruthfulQA, and MMLU Pro show that models with similar mean ability can exhibit substantially different fluctuation, and stability risk can overturn prioritization decisions in agentic or high failure cost regimes. By providing a 2D and grade based language for regime specific selection, CreditAudit supports tiered deployment and more disciplined allocation of testing and monitoring effort, enabling more objective and trustworthy model evaluation for real world use.

117. 【2602.02510】Beyond Translation: Cross-Cultural Meme Transcreation with Vision-Language Models

链接https://arxiv.org/abs/2602.02510

作者:Yuming Zhao,Peiyi Zhang,Oana Ignat

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:specificity poses significant, poses significant challenges, cultural specificity poses, online communication, cross-cultural meme transcreation

备注

点击查看摘要

Abstract:Memes are a pervasive form of online communication, yet their cultural specificity poses significant challenges for cross-cultural adaptation. We study cross-cultural meme transcreation, a multimodal generation task that aims to preserve communicative intent and humor while adapting culture-specific references. We propose a hybrid transcreation framework based on vision-language models and introduce a large-scale bidirectional dataset of Chinese and US memes. Using both human judgments and automated evaluation, we analyze 6,315 meme pairs and assess transcreation quality across cultural directions. Our results show that current vision-language models can perform cross-cultural meme transcreation to a limited extent, but exhibit clear directional asymmetries: US-Chinese transcreation consistently achieves higher quality than Chinese-US. We further identify which aspects of humor and visual-textual design transfer across cultures and which remain challenging, and propose an evaluation framework for assessing cross-cultural multimodal generation. Our code and dataset are publicly available at this https URL.

118. 【2602.02499】ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

链接https://arxiv.org/abs/2602.02499

作者:Yunao Zheng,Xiaojie Wang,Lei Ren,Wei Chen

类目:Computation and Language (cs.CL)

关键词:central challenges facing, challenges facing today, facing today large, today large language, large language models

备注

点击查看摘要

Abstract:Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning leverages in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we employ the binary discretization strategy and the counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at this https URL.

119. 【2602.02498】st-Time Detoxification without Training or Learning Anything

链接https://arxiv.org/abs/2602.02498

作者:Baturay Saglam,Dionysis Kalogerias

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, creating risks, deployed at scale, risks when deployed

备注

点击查看摘要

Abstract:Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

120. 【2602.02497】STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.02497

作者:Xuzhao Li,Xuchen Li,Jian Zhao,Shiyu Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, achieve significant breakthroughs, measuring machine intelligence, complex reasoning tasks

备注: Preprint, Under review

点击查看摘要

Abstract:As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

121. 【2602.02496】he Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

链接https://arxiv.org/abs/2602.02496

作者:Shikhar Shiromani,Archie Chaudhury,Sri Pranav Kunda

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, frequently exhibit unfaithful, exhibit unfaithful behavior, frequently exhibit

备注: 8 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally "knows" the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).

122. 【2602.02488】RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

链接https://arxiv.org/abs/2602.02488

作者:Yinjie Wang,Tianbao Xie,Ke Shen,Mengdi Wang,Ling Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:reinforcement learning framework, dynamically forges environment, amplifying learning signals, closed-loop optimization, framework that dynamically

备注: Code: [this https URL](https://github.com/Gen-Verse/Open-AgentRL)

点击查看摘要

Abstract:We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL

123. 【2602.02276】Kimi K2.5: Visual Agentic Intelligence

链接https://arxiv.org/abs/2602.02276

作者:Kimi Team:Tongtong Bai,Yifan Bai,Yiping Bao,S.H. Cai,Yuan Cao,Y. Charles,H.S. Che,Cheng Chen,Guanduo Chen,Huarong Chen,Jia Chen,Jiahao Chen,Jianlong Chen,Jun Chen,Kefan Chen,Liang Chen,Ruijue Chen,Xinhao Chen,Yanru Chen,Yanxu Chen,Yicun Chen,Yimin Chen,Yingjiang Chen,Yuankun Chen,Yujie Chen,Yutian Chen,Zhirong Chen,Ziwei Chen,Dazhi Cheng,Minghan Chu,Jialei Cui,Jiaqi Deng,Muxi Diao,Hao Ding,Mengfan Dong,Mengnan Dong,Yuxin Dong,Yuhao Dong,Angang Du,Chenzhuang Du,Dikang Du,Lingxiao Du,Yulun Du,Yu Fan,Shengjun Fang,Qiulin Feng,Yichen Feng,Garimugai Fu,Kelin Fu,Hongcheng Gao,Tong Gao,Yuyao Ge,Shangyi Geng,Chengyang Gong,Xiaochen Gong,Zhuoma Gongque,Qizheng Gu,Xinran Gu,Yicheng Gu,Longyu Guan,Yuanying Guo,Xiaoru Hao,Weiran He,Wenyang He,Yunjia He,Chao Hong,Hao Hu,Jiaxi Hu,Yangyang Hu,Zhenxing Hu,Ke Huang,Ruiyuan Huang,Weixiao Huang,Zhiqi Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yu Jing,Guokun Lai,Aidi Li,C. Li,Cheng Li,Fang Li,Guanghe Li,Guanyu Li,Haitao Li,Haoyang Li,Jia Li,Jingwei Li,Junxiong Li,Lincan Li,Mo Li,Weihong Li,Wentao Li,Xinhang Li,Xinhao Li,Yang Li,Yanhao Li,Yiwei Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:advance general agentic, designed to advance, advance general, general agentic intelligence, Kimi

备注: Kimi K2.5 tech report

点击查看摘要

Abstract:We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

124. 【2602.01995】hinking Like a Doctor: Conversational Diagnosis through the Exploration of Diagnostic Knowledge Graphs

链接https://arxiv.org/abs/2602.01995

作者:Jeongmoon Won,Seungwon Kook,Yohan Jo

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:refine differential diagnoses, requires multi-turn history-taking, multi-turn history-taking, diagnosis requires multi-turn, refine differential

备注

点击查看摘要

Abstract:Conversational diagnosis requires multi-turn history-taking, where an agent asks clarifying questions to refine differential diagnoses under incomplete information. Existing approaches often rely on the parametric knowledge of a model or assume that patients provide rich and concrete information, which is unrealistic. To address these limitations, we propose a conversational diagnosis system that explores a diagnostic knowledge graph to reason in two steps: (i) generating diagnostic hypotheses from the dialogue context, and (ii) verifying hypotheses through clarifying questions, which are repeated until a final diagnosis is reached. Since evaluating the system requires a realistic patient simulator that responds to the system's questions, we adopt a well-established simulator along with patient profiles from MIMIC-IV. We further adapt it to describe symptoms vaguely to reflect real-world patients during early clinical encounters. Experiments show improved diagnostic accuracy and efficiency over strong baselines, and evaluations by physicians support the realism of our simulator and the clinical utility of the generated questions. Our code will be released upon publication.

125. 【2602.01538】Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

链接https://arxiv.org/abs/2602.01538

作者:Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:talking avatars, grounded human-object, Aware Generation Module, perception, grounded human-object interaction

备注

点击查看摘要

Abstract:Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL

126. 【2602.03245】Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect

链接https://arxiv.org/abs/2602.03245

作者:Nikola Ljubešić,Peter Rupnik,Tea Perinčić

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:documents our efforts, efforts in releasing, written and spoken, audio book, audio components

备注: 2 figures, 14 pages, accepted and presented at JTDH 2024

点击查看摘要

Abstract:This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the this http URL repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already -- adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.

127. 【2602.02980】WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

链接https://arxiv.org/abs/2602.02980

作者:Xi Xuan,Davide Carbone,Ruchi Pandey,Wenxin Zhang,Tomi H. Kinnunen

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Signal Processing (eess.SP)

关键词:detectors primarily focuses, deepfake detectors primarily, Designing front-ends, detectors primarily, primarily focuses

备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.

128. 【2602.02940】A vector logic for intensional formal semantics

链接https://arxiv.org/abs/2602.02940

作者:Daniel Quigley

类目:Logic (math.LO); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)

关键词:Formal semantics, model-theoretic structures, shaped by usage, high-dimensional spaces shaped, linguistic meaning

备注: 25 pages; 68 sources

点击查看摘要

Abstract:Formal semantics and distributional semantics are distinct approaches to linguistic meaning: the former models meaning as reference via model-theoretic structures; the latter represents meaning as vectors in high-dimensional spaces shaped by usage. This paper proves that these frameworks are structurally compatible for intensional semantics. We establish that Kripke-style intensional models embed injectively into vector spaces, with semantic functions lifting to (multi)linear maps that preserve composition. The construction accommodates multiple index sorts (worlds, times, locations) via a compound index space, representing intensions as linear operators. Modal operators are derived algebraically: accessibility relations become linear operators, and modal conditions reduce to threshold checks on accumulated values. For uncountable index domains, we develop a measure-theoretic generalization in which necessity becomes truth almost everywhere and possibility becomes truth on a set of positive measure, a non-classical logic natural for continuous parameters.

129. 【2602.02734】WAXAL: A Large-Scale Multilingual African Language Speech Corpus

链接https://arxiv.org/abs/2602.02734

作者:Abdoulaye Diack,Perry Nelson,Kwaku Agbesi,Angela Nakalembe,MohamedElfatih MohamedKhair,Vusumuzi Dube,Tavonga Siyavora,Subhashini Venugopalan,Jason Hickey,Uche Okonkwo,Abhishek Bapna,Isaac Wiafe,Raynard Dodzi Helegah,Elikem Doe Atsakpo,Charles Nutrokpor,Fiifi Baffoe Payin Winful,Kafui Kwashie Solaga,Jamal-Deen Abdulai,Akon Obu Ekpezu,Audace Niyonkuru,Samuel Rutunda,Boris Ishimwe,Michael Melese,Engineer Bainomugisha,Joyce Nakatumba-Nabende,Andrew Katumba,Claire Babirye,Jonathan Mukiibi,Vincent Kimani,Samuel Kibacia,James Maina,Fridah Emmah,Ahmed Ibrahim Shekarau,Ibrahim Shehu Adamu,Yusuf Abdullahi,Howard Lakougna,Bob MacDonald,Hadar Shemtov,Aisha Walcott-Bryant,Moustapha Cisse,Avinatan Hassidim,Jeff Dean,Yossi Matias

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:predominantly favored high-resource, favored high-resource languages, Sub-Saharan African languages, significant digital divide, Automated Speech Recognition

备注: Initial dataset release

点击查看摘要

Abstract:The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at this https URL under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.

130. 【2602.02598】Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies

链接https://arxiv.org/abs/2602.02598

作者:Yueqing Hu,Yixuan Jiang,Zehua Jiang,Xiao Wen,Tianhong Wang

类目:Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:Large Language Models, Large Language, evolution of Large, Multi-Agent Systems, Systems where collective

备注: 7 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems where collective cooperation is often threatened by the "Tragedy of the Commons." This study investigates the effectiveness of Anchoring Agents--pre-programmed altruistic entities--in fostering cooperation within a Public Goods Game (PGG). Using a full factorial design across three state-of-the-art LLMs, we analyzed both behavioral outcomes and internal reasoning chains. While Anchoring Agents successfully boosted local cooperation rates, cognitive decomposition and transfer tests revealed that this effect was driven by strategic compliance and cognitive offloading rather than genuine norm internalization. Notably, most agents reverted to self-interest in new environments, and advanced models like GPT-4.1 exhibited a "Chameleon Effect," masking strategic defection under public scrutiny. These findings highlight a critical gap between behavioral modification and authentic value alignment in artificial societies.

信息检索

1. 【2602.03713】Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

链接https://arxiv.org/abs/2602.03713

作者:Moritz Vandenhirtz,Kaveh Hassani,Shervin Ghasemlou,Shuai Shao,Hamid Eghbalzadeh,Fuchun Peng,Jun Liu,Michael Louis Iuzzolino

类目:Information Retrieval (cs.IR)

关键词:user interaction history, resulting user representation, systems rank relevant, stored item embeddings, rank relevant items

备注

点击查看摘要

Abstract:Sequential recommender systems rank relevant items by modeling a user's interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. To resolve this, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a novel self-supervised quantization learning approach for images based on the DINO framework. Additionally, MSCGRec fuses collaborative and semantic signals by extracting collaborative features from sequential recommenders and treating them as a separate modality. Finally, we propose constrained sequence learning that restricts the large output space during training to the set of permissible tokens. We empirically demonstrate on three large real-world datasets that MSCGRec outperforms both sequential and generative recommendation baselines and provide an extensive ablation study to validate the impact of each component.

2. 【2602.03692】Bringing Reasoning to Generative Recommendation Through the Lens of Cascaded Ranking

链接https://arxiv.org/abs/2602.03692

作者:Xinyu Lin,Pengyuan Liu,Wenjie Wang,Yicheng Hu,Chen Xu,Fuli Feng,Qifan Wang,Tat-Seng Chua

类目:Information Retrieval (cs.IR)

关键词:high FLOPS utilization, high FLOPS, FLOPS utilization, Generative Recommendation, approach with high

备注: Accepted by WWW2026

点击查看摘要

Abstract:Generative Recommendation (GR) has become a promising end-to-end approach with high FLOPS utilization for resource-efficient recommendation. Despite the effectiveness, we show that current GR models suffer from a critical \textbf{bias amplification} issue, where token-level bias escalates as token generation progresses, ultimately limiting the recommendation diversity and hurting the user experience. By comparing against the key factor behind the success of traditional multi-stage pipelines, we reveal two limitations in GR that can amplify the bias: homogeneous reliance on the encoded history, and fixed computational budgets that prevent deeper user preference understanding. To combat the bias amplification issue, it is crucial for GR to 1) incorporate more heterogeneous information, and 2) allocate greater computational resources at each token generation step. To this end, we propose CARE, a simple yet effective cascaded reasoning framework for debiased GR. To incorporate heterogeneous information, we introduce a progressive history encoding mechanism, which progressively incorporates increasingly fine-grained history information as the generation process advances. To allocate more computations, we propose a query-anchored reasoning mechanism, which seeks to perform a deeper understanding of historical information through parallel reasoning steps. We instantiate CARE on three GR backbones. Empirical results on four datasets show the superiority of CARE in recommendation accuracy, diversity, efficiency, and promising scalability. The codes and datasets are available at this https URL.

Comments:
Accepted by WWW2026

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2602.03692 [cs.IR]

(or
arXiv:2602.03692v2 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.03692

Focus to learn more

              arXiv-issued DOI via DataCite</p>
3. 【2602.03652】RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

链接https://arxiv.org/abs/2602.03652

作者:Süha Kağan Köse,Mehmet Can Baytekin,Burak Aktaş,Bilge Kaan Görür,Evren Ayberk Munis,Deniz Yılmaz,Muhammed Yusuf Kartal,Çağrı Toraman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:enhances LLM factuality, guidance remains English-centric, design guidance remains, morphologically rich languages, Retrieval-Augmented Generation

备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.

4. 【2602.03640】utorial on Reasoning for IR IR for Reasoning

链接https://arxiv.org/abs/2602.03640

作者:Mohanna Hoveyda,Panagiotis Efstratiadis,Arjen de Vries,Maarten de Rijke

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantic relatedness, long focused, focused on ranking, ranking documents, documents by semantic

备注: Accepted to ECIR 2026

点击查看摘要

Abstract:Information retrieval has long focused on ranking documents by semantic relatedness. Yet many real-world information needs demand more: enforcement of logical constraints, multi-step inference, and synthesis of multiple pieces of evidence. Addressing these requirements is, at its core, a problem of reasoning. Across AI communities, researchers are developing diverse solutions for the problem of reasoning, from inference-time strategies and post-training of LLMs, to neuro-symbolic systems, Bayesian and probabilistic frameworks, geometric representations, and energy-based models. These efforts target the same problem: to move beyond pattern-matching systems toward structured, verifiable inference. However, they remain scattered across disciplines, making it difficult for IR researchers to identify the most relevant ideas and opportunities. To help navigate the fragmented landscape of research in reasoning, this tutorial first articulates a working definition of reasoning within the context of information retrieval and derives from it a unified analytical framework. The framework maps existing approaches along axes that reflect the core components of the definition. By providing a comprehensive overview of recent approaches and mapping current methods onto the defined axes, we expose their trade-offs and complementarities, highlight where IR can benefit from cross-disciplinary advances, and illustrate how retrieval process itself can play a central role in broader reasoning systems. The tutorial will equip participants with both a conceptual framework and practical guidance for enhancing reasoning-capable IR systems, while situating IR as a domain that both benefits and contributes to the broader development of reasoning methodologies.

5. 【2602.03608】Controlling Output Rankings in Generative Engines for LLM-based Search

链接https://arxiv.org/abs/2602.03608

作者:Haibo Jin,Ruoxi Chen,Peiyan Zhang,Yifeng Luo,Huimin Zeng,Man Luo,Haohan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, textbf, rise of large, search, CORE

备注: 23 pages

点击查看摘要

Abstract:The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbf{C}ontrols \textbf{O}utput \textbf{R}ankings in g\textbf{E}nerative Engines for LLM-based search. Since the LLM's interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon's search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf{91.4\% @Top-5}, \textbf{86.6\% @Top-3}, and \textbf{80.3\% @Top-1}, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content.

Comments:
23 pages

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.03608 [cs.CL]

(or
arXiv:2602.03608v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.03608

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2602.03439】Ontology-to-tools compilation for executable semantic constraint enforcement in LLM agents

链接https://arxiv.org/abs/2602.03439

作者:Xiaochi Zhou,Patrick Bulter,Changxuan Yang,Simon D. Rihm,Thitikarn Angkanaporn,Jethro Akroyd,Sebastian Mosbach,Markus Kraft

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:coupling large language, large language models, Model Context Protocol, mechanism for coupling, coupling large

备注

点击查看摘要

Abstract:We introduce ontology-to-tools compilation as a proof-of-principle mechanism for coupling large language models (LLMs) with formal domain knowledge. Within The World Avatar (TWA), ontological specifications are compiled into executable tool interfaces that LLM-based agents must use to create and modify knowledge graph instances, enforcing semantic constraints during generation rather than through post-hoc validation. Extending TWA's semantic agent composition framework, the Model Context Protocol (MCP) and associated agents are integral components of the knowledge graph ecosystem, enabling structured interaction between generative models, symbolic constraints, and external resources. An agent-based workflow translates ontologies into ontology-aware tools and iteratively applies them to extract, validate, and repair structured knowledge from unstructured scientific text. Using metal-organic polyhedra synthesis literature as an illustrative case, we show how executable ontological semantics can guide LLM behaviour and reduce manual schema and prompt engineering, establishing a general paradigm for embedding formal knowledge into generative systems.

7. 【2602.03432】Failure is Feedback: History-Aware Backtracking for Agentic Traversal in Multimodal Graphs

链接https://arxiv.org/abs/2602.03432

作者:Joohyung Yun,Doyup Lee,Wook-Shin Han

类目:Information Retrieval (cs.IR)

关键词:Open-domain multimodal document, interconnected document corpora, retrieve specific components, Open-domain multimodal, multimodal document retrieval

备注: Project page: [this https URL](https://failureisfeedback.github.io/)

点击查看摘要

Abstract:Open-domain multimodal document retrieval aims to retrieve specific components (paragraphs, tables, or images) from large and interconnected document corpora. Existing graph-based retrieval approaches typically rely on a uniform similarity metric that overlooks hop-specific semantics, and their rigid pre-defined plans hinder dynamic error correction. These limitations suggest that a retriever should adapt its reasoning to the evolving context and recover intelligently from dead ends. To address these needs, we propose Failure is Feedback (FiF), which casts subgraph retrieval as a sequential decision process and introduces two key innovations. (i) We introduce a history-aware backtracking mechanism; unlike standard backtracking that simply reverts the state, our approach piggybacks on the context of failed traversals, leveraging insights from previous failures. (ii) We implement an economically-rational agentic workflow. Unlike conventional agents with static strategies, our orchestrator employs a cost-aware traversal method to dynamically manage the trade-off between retrieval accuracy and inference costs, escalating to intensive LLM-based reasoning only when the prior failure justifies the additional computational investment. Extensive experiments show that FiF achieves state-of-the-art retrieval on the benchmarks of MultimodalQA, MMCoQA and WebQA.

8. 【2602.03422】RankSteer: Activation Steering for Pointwise LLM Ranking

链接https://arxiv.org/abs/2602.03422

作者:Yumeng Wang,Catherine Chen,Suzan Verberne

类目:Information Retrieval (cs.IR)

关键词:Large language models, recently shown strong, shown strong performance, Large language, role-play instructions

备注

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong performance as zero-shot rankers, yet their effectiveness is highly sensitive to prompt formulation, particularly role-play instructions. Prior analyses suggest that role-related signals are encoded along activation channels that are largely separate from query-document representations, raising the possibility of steering ranking behavior directly at the activation level rather than through brittle prompt engineering. In this work, we propose RankSteer, a post-hoc activation steering framework for zero-shot pointwise LLM ranking. We characterize ranking behavior through three disentangled and steerable directions in representation space: a \textbf{decision direction} that maps hidden states to relevance scores, an \textbf{evidence direction} that captures relevance signals not directly exploited by the decision head, and a \textbf{role direction} that modulates model behavior without injecting relevance information. Using projection-based interventions at inference time, RankSteer jointly controls these directions to calibrate ranking behavior without modifying model weights or introducing explicit cross-document comparisons. Experiments on TREC DL 20 and multiple BEIR benchmarks show that RankSteer consistently improves ranking quality using only a small number of anchor queries, demonstrating that substantial ranking capacity remains under-utilized in pointwise LLM rankers. We further provide a geometric analysis revealing that steering improves ranking by stabilizing ranking geometry and reducing dispersion, offering new insight into how LLMs internally represent and calibrate relevance judgments.

9. 【2602.03416】AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation

链接https://arxiv.org/abs/2602.03416

作者:Wenxin Ye,Lin Li,Ming Li,Yang Shen,Kanghong Wang,Jimmy Xiangji Huang

类目:Information Retrieval (cs.IR)

关键词:Clothing recommendation extends, crucial medium, Clothing recommendation, clothing recommendation models, aesthetic

备注

点击查看摘要

Abstract:Clothing recommendation extends beyond merely generating personalized outfits; it serves as a crucial medium for aesthetic guidance. However, existing methods predominantly rely on user-item-outfit interaction behaviors while overlooking explicit representations of clothing aesthetics. To bridge this gap, we present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotations, thereby enabling the development of aesthetics-aligned recommendation systems. Grounded in professional apparel quality standards and fashion aesthetic principles, we define a multidimensional set of indicators. At the item level, six dimensions are independently assessed: silhouette, chromaticity, materiality, craftsmanship, wearability, and item-level impression. Transitioning to the outfit level, the evaluation retains the first five core attributes while introducing stylistic synergy, visual harmony, and outfit-level impression as distinct metrics to capture the collective aesthetic impact. Given the increasing human-like proficiency of Vision-Language Models in multimodal understanding and interaction, we leverage them for large-scale aesthetic scoring. We conduct rigorous human-machine consistency validation on a fashion dataset, confirming the reliability of the generated ratings. Experimental results based on AesRec further demonstrate that integrating quantified aesthetic information into clothing recommendation models can provide aesthetic guidance for users while fulfilling their personalized requirements.

10. 【2602.03345】Beyond Exposure: Optimizing Ranking Fairness with Non-linear Time-Income Functions

链接https://arxiv.org/abs/2602.03345

作者:Xuancheng Li,Tao Yang,Yujia Zhou,Qingyao Ai,Yiqun Liu

类目:Information Retrieval (cs.IR)

关键词:search and recommendation, fairness, central to information, information distribution, distribution in web

备注

点击查看摘要

Abstract:Ranking is central to information distribution in web search and recommendation. Nowadays, in ranking optimization, the fairness to item providers is viewed as a crucial factor alongside ranking relevance for users. There are currently numerous concepts of fairness and one widely recognized fairness concept is Exposure Fairness. However, it relies primarily on exposure determined solely by position, overlooking other factors that significantly influence income, such as time. To address this limitation, we propose to study ranking fairness when the provider utility is influenced by other contextual factors and is neither equal to nor proportional to item exposure. We give a formal definition of Income Fairness and develop a corresponding measurement metric. Simulated experiments show that existing-exposure-fairness-based ranking algorithms fail to optimize the proposed income fairness. Therefore, we propose the Dynamic-Income-Derivative-aware Ranking Fairness algorithm, which, based on the marginal income gain at the present timestep, uses Taylor-expansion-based gradients to simultaneously optimize effectiveness and income fairness. In both offline and online settings with diverse time-income functions, DIDRF consistently outperforms state-of-the-art methods.

11. 【2602.03324】SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation

链接https://arxiv.org/abs/2602.03324

作者:Chao Chen,Longfei Xu,Daohan Su,Tengfei Liu,Hanyu Guo,Yihai Duan,Kaikui Liu,Xiangxiang Chu

类目:Information Retrieval (cs.IR)

关键词:multi-stage pipeline involving, produce high-quality ordered, high-quality ordered recommendations, pipeline involving fine-ranking, systems commonly adopt

备注

点击查看摘要

Abstract:Route recommendation systems commonly adopt a multi-stage pipeline involving fine-ranking and re-ranking to produce high-quality ordered recommendations. However, this paradigm faces three critical limitations. First, there is a misalignment between offline training objectives and online metrics. Offline gains do not necessarily translate to online improvements. Actual performance must be validated through A/B testing, which may potentially compromise the user experience. Second, redundancy elimination relies on rigid, handcrafted rules that lack adaptability to the high variance in user intent and the unstructured complexity of real-world scenarios. Third, the strict separation between fine-ranking and re-ranking stages leads to sub-optimal performance. Since each module is optimized in isolation, the fine-ranking stage remains oblivious to the list-level objectives (e.g., diversity) targeted by the re-ranker, thereby preventing the system from achieving a jointly optimized global optimum. To overcome these intertwined challenges, we propose SCASRec (Self-Correcting and Auto-Stopping Recommendation), a unified generative framework that integrates ranking and redundancy elimination into a single end-to-end process. SCASRec introduces a stepwise corrective reward (SCR) to guide list-wise refinement by focusing on hard samples, and employs a learnable End-of-Recommendation (EOR) token to terminate generation adaptively when no further improvement is expected. Experiments on two large-scale, open-sourced route recommendation datasets demonstrate that SCASRec establishes an SOTA in offline and online settings. SCASRec has been fully deployed in a real-world navigation app, demonstrating its effectiveness.

12. 【2602.03306】Learning to Select: Query-Aware Adaptive Dimension Selection for Dense Retrieval

链接https://arxiv.org/abs/2602.03306

作者:Zhanyu Wu,Richong Zhang,Zhijie Nie

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:retrieval represents queries, Dense retrieval represents, ful for ranking, ments as high-dimensional, retrieval represents

备注

点击查看摘要

Abstract:Dense retrieval represents queries and docu-002 ments as high-dimensional embeddings, but003 these representations can be redundant at the004 query level: for a given information need, only005 a subset of dimensions is consistently help-006 ful for ranking. Prior work addresses this via007 pseudo-relevance feedback (PRF) based dimen-008 sion importance estimation, which can produce009 query-aware masks without labeled data but010 often relies on noisy pseudo signals and heuris-011 tic test-time procedures. In contrast, super-012 vised adapter methods leverage relevance labels013 to improve embedding quality, yet they learn014 global transformations shared across queries015 and do not explicitly model query-aware di-016 mension importance. We propose a Query-017 Aware Adaptive Dimension Selection frame-018 work that learns to predict per-dimension im-019 portance directly from query embedding. We020 first construct oracle dimension importance dis-021 tributions over embedding dimensions using022 supervised relevance labels, and then train a023 predictor to map a query embedding to these024 label-distilled importance scores. At inference,025 the predictor selects a query-aware subset of026 dimensions for similarity computation based027 solely on the query embedding, without pseudo-028 relevance feedback. Experiments across multi-029 ple dense retrievers and benchmarks show that030 our learned dimension selector improves re-031 trieval effectiveness over the full-dimensional032 baseline as well as PRF-based masking and033 supervised adapter baselines.

13. 【2602.03304】o Search or Not to Search: Aligning the Decision Boundary of Deep Search Agents via Causal Intervention

链接https://arxiv.org/abs/2602.03304

作者:Wenlin Zhang,Kuicai Dong,Junyi Li,Yingyi Zhang,Xiaopeng Li,Pengyue Jia,Yi Wen,Derong Xu,Maolin Wang,Yichao Wang,Yong Liu,Xiangyu Zhao

类目:Information Retrieval (cs.IR)

关键词:multi-turn web-based reasoning, complex information-seeking tasks, web-based reasoning, represent a promising, information-seeking tasks

备注

点击查看摘要

Abstract:Deep search agents, which autonomously iterate through multi-turn web-based reasoning, represent a promising paradigm for complex information-seeking tasks. However, current agents suffer from critical inefficiency: they conduct excessive searches as they cannot accurately judge when to stop searching and start answering. This stems from outcome-centric training that prioritize final results over the search process itself. We identify the root cause as misaligned decision boundaries, the threshold determining when accumulated information suffices to answer. This causes over-search (redundant searching despite sufficient knowledge) and under-search (premature termination yielding incorrect answers). To address these errors, we propose a comprehensive framework comprising two key components. First, we introduce causal intervention-based diagnosis that identifies boundary errors by comparing factual and counterfactual trajectories at each decision point. Second, we develop Decision Boundary Alignment for Deep Search agents (DAS), which constructs preference datasets from causal feedback and aligns policies via preference optimization. Experiments on public datasets demonstrate that decision boundary errors are pervasive across state-of-the-art agents. Our DAS method effectively calibrates these boundaries, mitigating both over-search and under-search to achieve substantial gains in accuracy and efficiency. Our code and data are publicly available at: this https URL.

14. 【2602.03223】Distribution-Aware End-to-End Embedding for Streaming Numerical Features in Click-Through Rate Prediction

链接https://arxiv.org/abs/2602.03223

作者:Jiahao Liu,Hongji Ruan,Weimin Zhang,Ziye Tong,Derick Tang,Zhanpeng Zeng,Qinsong Zeng,Peng Zhang,Tun Lu,Ning Gu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Click-Through Rate prediction, paper explores effective, Click-Through Rate, Rate prediction, explores effective numerical

备注: Under review

点击查看摘要

Abstract:This paper explores effective numerical feature embedding for Click-Through Rate prediction in streaming environments. Conventional static binning methods rely on offline statistics of numerical distributions; however, this inherently two-stage process often triggers semantic drift during bin boundary updates. While neural embedding methods enable end-to-end learning, they often discard explicit distributional information. Integrating such information end-to-end is challenging because streaming features often violate the i.i.d. assumption, precluding unbiased estimation of the population distribution via the expectation of order statistics. Furthermore, the critical context dependency of numerical distributions is often neglected. To this end, we propose DAES, an end-to-end framework designed to tackle numerical feature embedding in streaming training scenarios by integrating distributional information with an adaptive modulation mechanism. Specifically, we introduce an efficient reservoir-sampling-based distribution estimation method and two field-aware distribution modulation strategies to capture streaming distributions and field-dependent semantics. DAES significantly outperforms existing approaches as demonstrated by extensive offline and online experiments and has been fully deployed on a leading short-video platform with hundreds of millions of daily active users.

15. 【2602.03158】PAMAS: Self-Adaptive Multi-Agent System with Perspective Aggregation for Misinformation Detection

链接https://arxiv.org/abs/2602.03158

作者:Zongwei Wang,Min Gao,Junliang Yu,Tong Chen,Chenghua Lin

类目:Information Retrieval (cs.IR)

关键词:social media poses, context-dependent nature complicates, nature complicates detection, social media, media poses

备注: 12 pages

点击查看摘要

Abstract:Misinformation on social media poses a critical threat to information credibility, as its diverse and context-dependent nature complicates detection. Large language model-empowered multi-agent systems (MAS) present a promising paradigm that enables cooperative reasoning and collective intelligence to combat this threat. However, conventional MAS suffer from an information-drowning problem, where abundant truthful content overwhelms sparse and weak deceptive cues. With full input access, agents tend to focus on dominant patterns, and inter-agent communication further amplifies this bias. To tackle this issue, we propose PAMAS, a multi-agent framework with perspective aggregation, which employs hierarchical, perspective-aware aggregation to highlight anomaly cues and alleviate information drowning. PAMAS organizes agents into three roles: Auditors, Coordinators, and a Decision-Maker. Auditors capture anomaly cues from specialized feature subsets; Coordinators aggregate their perspectives to enhance coverage while maintaining diversity; and the Decision-Maker, equipped with evolving memory and full contextual access, synthesizes all subordinate insights to produce the final judgment. Furthermore, to improve efficiency in multi-agent collaboration, PAMAS incorporates self-adaptive mechanisms for dynamic topology optimization and routing-based inference, enhancing both efficiency and scalability. Extensive experiments on multiple benchmark datasets demonstrate that PAMAS achieves superior accuracy and efficiency, offering a scalable and trustworthy way for misinformation detection.

16. 【2602.03059】From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

链接https://arxiv.org/abs/2602.03059

作者:Yoonsang Kim,Divyansh Pradhan,Devshree Jadeja,Arie Kaufman

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)

关键词:converts verbal remote-assistance, framework that converts, spatially grounded, referent disambiguation framework, verbal remote-assistance instructions

备注: 11 pages, 6 figures. This is the author's version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026

点击查看摘要

Abstract:We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

17. 【2602.03056】ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding

链接https://arxiv.org/abs/2602.03056

作者:Lu Ren,Junda She,Xinchen Luo,Tao Wang,Xin Ye,Xu Zhang,Muxuan Wang,Xiao Yang,Chenguang Wang,Fei Xie,Yiwei Zhou,Danjun Wu,Guodong Zhang,Yifei Hu,Guoying Zheng,Shujie Yang,Xingmei Wang,Shiyao Wang,Yukun Zhou,Fan Yang,Size Li,Kuo Cai,Qiang Luo,Ruiming Tang,Han Li,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:Recent advances, accurately capturing user, personalized recommendation, key challenge, advances in large

备注

点击查看摘要

Abstract:Recent advances in large language models have highlighted their potential for personalized recommendation, where accurately capturing user preferences remains a key challenge. Leveraging their strong reasoning and generalization capabilities, LLMs offer new opportunities for modeling long-term user behavior. To systematically evaluate this, we introduce ALPBench, a Benchmark for Attribution-level Long-term Personal Behavior Understanding. Unlike item-focused benchmarks, ALPBench predicts user-interested attribute combinations, enabling ground-truth evaluation even for newly introduced items. It models preferences from long-term historical behaviors rather than users' explicitly expressed requests, better reflecting enduring interests. User histories are represented as natural language sequences, allowing interpretable, reasoning-based personalization. ALPBench enables fine-grained evaluation of personalization by focusing on the prediction of attribute combinations task that remains highly challenging for current LLMs due to the need to capture complex interactions among multiple attributes and reason over long-term user behavior sequences.

18. 【2602.02883】Efficiency Optimizations for Superblock-based Sparse Retrieval

链接https://arxiv.org/abs/2602.02883

作者:Parker Carlson,Wentai Xie,Rohil Shah,Tao Yang

类目:Information Retrieval (cs.IR)

关键词:efficient CPU-friendly algorithms, Learned sparse retrieval, Learned sparse, CPU-friendly algorithms, popular method

备注: 11 pages, 5 figures, 9 tables. Under review

点击查看摘要

Abstract:Learned sparse retrieval (LSR) is a popular method for first-stage retrieval because it combines the semantic matching of language models with efficient CPU-friendly algorithms. Previous work aggregates blocks into "superblocks" to quickly skip the visitation of blocks during query processing by using an advanced pruning heuristic. This paper proposes a simple and effective superblock pruning scheme that reduces the overhead of superblock score computation while preserving competitive relevance. It combines this scheme with a compact index structure and a robust zero-shot configuration that is effective across LSR models and multiple datasets. This paper provides an analytical justification and evaluation on the MS MARCO and BEIR datasets, demonstrating that the proposed scheme can be a strong alternative for efficient sparse retrieval.

19. 【2602.02827】Col-Bandit: Zero-Shot Query-Time Pruning for Late-Interaction Retrieval

链接https://arxiv.org/abs/2602.02827

作者:Roi Pony,Adi Raz,Oshri Naparstek,Idan Friedman,Udi Barzelay

类目:Information Retrieval (cs.IR)

关键词:exhaustively computing token-level, retrieval quality, computing token-level MaxSim, ColBERT achieve, dominated by exhaustively

备注

点击查看摘要

Abstract:Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art retrieval quality, but their query-time cost is dominated by exhaustively computing token-level MaxSim interactions for every candidate document. While approximating late interaction with single-vector representations reduces cost, it often incurs substantial accuracy loss. We introduce Col-Bandit, a query-time pruning algorithm that reduces this computational burden by casting reranking as a finite-population Top-$K$ identification problem. Col-Bandit maintains uncertainty-aware bounds over partially observed document scores and adaptively reveals only the (document, query token) MaxSim entries needed to determine the top results under statistical decision bounds with a tunable relaxation. Unlike coarse-grained approaches that prune entire documents or tokens offline, Col-Bandit sparsifies the interaction matrix on the fly. It operates as a zero-shot, drop-in layer over standard multi-vector systems, requiring no index modifications, offline preprocessing, or model retraining. Experiments on textual (BEIR) and multimodal (REAL-MM-RAG) benchmarks show that Col-Bandit preserves ranking fidelity while reducing MaxSim FLOPs by up to 5$\times$, indicating that dense late-interaction scoring contains substantial redundancy that can be identified and pruned efficiently at query time.

20. 【2602.02636】WideSeek: Advancing Wide Research via Multi-Agent Scaling

链接https://arxiv.org/abs/2602.02636

作者:Ziyang Huang,Haolin Ren,Xiaowei Yuan,Jiawei Wang,Zhongtao Jiang,Kun Xu,Shizhu He,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:synthesizing comprehensive information, Wide Research, Wide Research paradigm, intelligence is evolving, essential for retrieving

备注

点击查看摘要

Abstract:Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

21. 【2602.02582】Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems

链接https://arxiv.org/abs/2602.02582

作者:Chandan Kumar Sah,Xiaoli Lian,Li Zhang,Tony Xu,Syed Shazaib Shah

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:Large language models, enable powerful zero-shot, broad contextual knowledge, leveraging broad contextual, embedded biases threaten

备注: Accepted at the Second Conference of the International Association for Safe and Ethical Artificial Intelligence, IASEAI26, 14 pages

点击查看摘要

Abstract:Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.

22. 【2602.02516】Measuring Individual User Fairness with User Similarity and Effectiveness Disparity

链接https://arxiv.org/abs/2602.02516

作者:Theresia Veronika Rampisela,Maria Maistro,Tuukka Ruotsalo,Christina Lioma

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Individual user fairness, user, Individual user, user fairness, user similarity

备注: Preprint of a work that has been accepted to ECIR 2026 Full Papers track as a Findings paper

点击查看摘要

Abstract:Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensitive to user similarity. We contribute the first RS evaluation measure to reliably capture both user similarity and effectiveness in individual user fairness. Our code: this https URL.

23. 【2602.02514】Design and Evaluation of Whole-Page Experience Optimization for E-commerce Search

链接https://arxiv.org/abs/2602.02514

作者:Pratik Lahiri,Bingqing Ge,Zhou Qin,Aditya Jumde,Shuning Huo,Lucas Scottini,Yi Liu,Mahmoud Mamlouk,Wenyang Liu

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:E-commerce Search Results, Search Results Pages, E-commerce Search, Results Pages, Search Results

备注

点击查看摘要

Abstract:E-commerce Search Results Pages (SRPs) are evolving from linear lists to complex, non-linear layouts, rendering traditional position-biased ranking models insufficient. Moreover, existing optimization frameworks typically maximize short-term signals (e.g., clicks, same-day revenue) because long-term satisfaction metrics (e.g., expected two-week revenue) involve delayed feedback and challenging long-horizon credit attribution. To bridge these gaps, we propose a novel Whole-Page Experience Optimization Framework. Unlike traditional list-wise rankers, our approach explicitly models the interplay between item relevance, 2D positional layout, and visual elements. We use a causal framework to develop metrics for measuring long-term user satisfaction based on quasi-experimental data. We validate our approach through industry-scale A/B testing, where the model demonstrated a 1.86% improvement in brand relevance (our primary customer experience metric) while simultaneously achieving a statistically significant revenue uplift of +0.05%

计算机视觉

1. 【2602.03847】EventNeuS: 3D Mesh Reconstruction from a Single Event Camera

链接https://arxiv.org/abs/2602.03847

作者:Shreyas Sachan,Viktor Rudnev,Mohamed Elgharib,Christian Theobalt,Vladislav Golyanik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Event cameras offer, RGB cameras, alternative to RGB, cameras offer, offer a considerable

备注: 13 pages, 10 figures, 3 tables; project page: [this https URL](https://4dqv.mpi-inf.mpg.de/EventNeuS/)

点击查看摘要

Abstract:Event cameras offer a considerable alternative to RGB cameras in many scenarios. While there are recent works on event-based novel-view synthesis, dense 3D mesh reconstruction remains scarcely explored and existing event-based techniques are severely limited in their 3D reconstruction accuracy. To address this limitation, we present EventNeuS, a self-supervised neural model for learning 3D representations from monocular colour event streams. Our approach, for the first time, combines 3D signed distance function and density field learning with event-based supervision. Furthermore, we introduce spherical harmonics encodings into our model for enhanced handling of view-dependent effects. EventNeuS outperforms existing approaches by a significant margin, achieving 34% lower Chamfer distance and 31% lower mean absolute error on average compared to the best previous method.

2. 【2602.03838】PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization

链接https://arxiv.org/abs/2602.03838

作者:Erzhen Hu,Frederik Brudy,David Ledo,George Fitzmaurice,Fraser Anderson

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:rapidly prototype ideas, conventional approaches involve, approaches involve trade-offs, animation experts, fullscale production

备注: 21 pages, 13 figures; accepted and to appear at CHI 2026

点击查看摘要

Abstract:In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.

3. 【2602.03828】AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

链接https://arxiv.org/abs/2602.03828

作者:Minjun Zhu,Zhen Lin,Yixuan Weng,Panzhong Lu,Qiujie Xie,Yifan Wei,Sifan Liu,Qiyao Sun,Yue Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:effectively communicating complex, manual creation remains, communicating complex scientific, High-quality scientific illustrations, scientific

备注: Accepted at the ICLR 2026

点击查看摘要

Abstract:High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in this https URL.

4. 【2602.03826】Continuous Control of Editing Models via Adaptive-Origin Guidance

链接https://arxiv.org/abs/2602.03826

作者:Alon Wolf,Chen Katzir,Kfir Aberman,Or Patashnik

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Diffusion-based editing models, Diffusion-based editing, powerful tool, tool for semantic, Diffusion-based

备注: Project page at [this https URL](https://adaor-paper.github.io/)

点击查看摘要

Abstract:Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.

5. 【2602.03815】Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

链接https://arxiv.org/abs/2602.03815

作者:Dingkun Zhang,Shuhan Qi,Yulin Wu,Xinyu Xiao,Xuan Wang,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, training inefficiency issue

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: this https URL

6. 【2602.03811】Progressive Checkerboards for Autoregressive Multiscale Image Generation

链接https://arxiv.org/abs/2602.03811

作者:David Eigen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modeling mutual dependencies, efficiently sample independent, sample independent locations, key challenge, independent locations

备注

点击查看摘要

Abstract:A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.

7. 【2602.03809】SplitSplat: Zero-Shot Panoptic Segmentation via Explicit Instance Modeling and 3D Gaussian Splatting

链接https://arxiv.org/abs/2602.03809

作者:Leonardo Monchieri,Elena Camuffo,Francesco Barbato,Pietro Zanuttigh,Simone Milani

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, semantically aware structure, high-quality scene reconstruction, enables fast, aware structure

备注

点击查看摘要

Abstract:3D Gaussian Splatting (GS) enables fast and high-quality scene reconstruction, but it lacks an object-consistent and semantically aware structure. We propose SplitSplat, a framework for panoptic scene reconstruction using 3DGS. Our approach explicitly models object instances. It first propagates instance masks across views using depth, thus producing view-consistent 2D masks. Each object is then reconstructed independently and merged back into the scene while refining its boundaries. Finally, instance-level semantic descriptors are embedded in the reconstructed objects, supporting various applications, including panoptic segmentation, object retrieval, and 3D editing. Unlike existing methods, SplitSplat tackles the problem by first segmenting the scene and then reconstructing each object individually. This design naturally supports downstream tasks and allows SplitSplat to achieve state-of-the-art performance on the ScanNetv2 segmentation benchmark.

8. 【2602.03798】FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

链接https://arxiv.org/abs/2602.03798

作者:Zimu Lu,Houxing Ren,Yunqiao Yang,Ke Wang,Zhuofan Zong,Mingjie Zhan,Hongsheng Li

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Assisting non-expert users, develop complex interactive, Assisting non-expert, complex interactive websites, frontend web pages

备注

点击查看摘要

Abstract:Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at this https URL.

9. 【2602.03796】3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

链接https://arxiv.org/abs/2602.03796

作者:Zhixue Fang,Xu He,Songlin Tang,Haoxian Zhang,Qingfeng Li,Xiaoqiang Liu,Pengfei Wan,Kun Gai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generation typically rely, video generation typically, generation typically, typically rely, motion

备注: Project Page: [this https URL](https://hjrphoebus.github.io/3DiMo/)

点击查看摘要

Abstract:Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

10. 【2602.03793】BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

链接https://arxiv.org/abs/2602.03793

作者:Yixiang Chen,Peiyan Li,Jiabing Yang,Keji He,Xiangnan Wu,Yuan Xu,Kai Wang,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:leverage large-scale Internet, large-scale Internet videos, Embodied world models, large-scale Internet, Internet videos

备注

点击查看摘要

Abstract:Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at this https URL .

11. 【2602.03785】From Pre- to Intra-operative MRI: Predicting Brain Shift in Temporal Lobe Resection for Epilepsy Surgery

链接https://arxiv.org/abs/2602.03785

作者:Jingjing Peng,Giorgio Fiore,Yang Liu,Ksenia Ellum,Debayan Daspupta,Keyoumars Ashkan,Andrew McEvoy,Anna Miserocchi,Sebastien Ourselin,John Duncan,Alejandro Granados

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:brain magnetic resonance, determining surgical paths, magnetic resonance images, image-guided Neurosurgery Systems, preoperative brain magnetic

备注

点击查看摘要

Abstract:Introduction: In neurosurgery, image-guided Neurosurgery Systems (IGNS) highly rely on preoperative brain magnetic resonance images (MRI) to assist surgeons in locating surgical targets and determining surgical paths. However, brain shift invalidates the preoperative MRI after dural opening. Updated intraoperative brain MRI with brain shift compensation is crucial for enhancing the precision of neuronavigation systems and ensuring the optimal outcome of surgical interventions. Methodology: We propose NeuralShift, a U-Net-based model that predicts brain shift entirely from pre-operative MRI for patients undergoing temporal lobe resection. We evaluated our results using Target Registration Errors (TREs) computed on anatomical landmarks located on the resection side and along the midline, and DICE scores comparing predicted intraoperative masks with masks derived from intraoperative MRI. Results: Our experimental results show that our model can predict the global deformation of the brain (DICE of 0.97) with accurate local displacements (achieve landmark TRE as low as 1.12 mm), compensating for large brain shifts during temporal lobe removal neurosurgery. Conclusion: Our proposed model is capable of predicting the global deformation of the brain during temporal lobe resection using only preoperative images, providing potential opportunities to the surgical team to increase safety and efficiency of neurosurgery and better outcomes to patients. Our contributions will be publicly available after acceptance in this https URL.

12. 【2602.03782】QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization

链接https://arxiv.org/abs/2602.03782

作者:Yuhao Xu,Yantai Yang,Zhenyang Fan,Yufan Liu,Yuming Li,Bing Li,Zhipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:immense computational demands, computational demands critically, demands critically hinder, critically hinder deployment, resource-constrained robotic platforms

备注: ICLR2026

点击查看摘要

Abstract:The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.

13. 【2602.03766】FOVI: A biologically-inspired foveated interface for deep vision models

链接https://arxiv.org/abs/2602.03766

作者:Nicholas M. Blauch,George A. Alvarez,Talia Konkle

类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)

关键词:variable resolution peaking, allowing eye-movements, eye-movements to bring, variable resolution, resolution peaking

备注

点击查看摘要

Abstract:Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at this https URL and this https URL.

14. 【2602.03760】RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images

链接https://arxiv.org/abs/2602.03760

作者:Mishal Fatima,Shashank Agnihotri,Kanchana Vaishnavi Gandikota,Michael Moeller,Margret Keuper

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ISP pipelines optimized, RGB images processed, discard sensor-level information, trained on RGB, ISP pipelines

备注: *Equal Contribution

点击查看摘要

Abstract:Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality detail, and generalization in low-bit RAW image processing. Dataset code upon acceptance.

15. 【2602.03753】st-Time Conditioning with Representation-Aligned Visual Features

链接https://arxiv.org/abs/2602.03753

作者:Nicolas Sereyjol-Garros,Ellington Kirby,Victor Letzelter,Victor Besnier,Nermin Samet

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion model training, remains largely unexplored, improve diffusion model, enhancing inference-time conditioning, inference-time conditioning remains

备注

点击查看摘要

Abstract:While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at this https URL.

16. 【2602.03750】Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives

链接https://arxiv.org/abs/2602.03750

作者:Owen Dong,Lily Gao,Manish Kota,Bennett A. Landmana,Jelena Bekvalac,Gaynor Western,Katherine D. Van Schaik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:millennial scale patterns, modern imaging technologies, anthropological remains, offers new windows, human health

备注

点击查看摘要

Abstract:Paleoradiology, the use of modern imaging technologies to study archaeological and anthropological remains, offers new windows on millennial scale patterns of human health. Unfortunately, the radiographs collected during field campaigns are heterogeneous: bones are disarticulated, positioning is ad hoc, and laterality markers are often absent. Additionally, factors such as age at death, age of bone, sex, and imaging equipment introduce high variability. Thus, content navigation, such as identifying a subset of images with a specific projection view, can be time consuming and difficult, making efficient triaging a bottleneck for expert analysis. We report a zero shot prompting strategy that leverages a state of the art Large Vision Language Model (LVLM) to automatically identify the main bone, projection view, and laterality in such images. Our pipeline converts raw DICOM files to bone windowed PNGs, submits them to the LVLM with a carefully engineered prompt, and receives structured JSON outputs, which are extracted and formatted onto a spreadsheet in preparation for validation. On a random sample of 100 images reviewed by an expert board certified paleoradiologist, the system achieved 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with low or medium confidence flags for ambiguous cases. These results suggest that LVLMs can substantially accelerate code word development for large paleoradiology datasets, allowing for efficient content navigation in future anthropology workflows.

17. 【2602.03749】See-through: Single-image Layer Decomposition for Anime Characters

链接https://arxiv.org/abs/2602.03749

作者:Jian Lin,Chengze Li,Haoyun Qin,Kwun Wang Chan,Yanghua Jin,Hanyuan Liu,Stephen Chun Wang Choy,Xueting Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:static anime illustrations, framework that automates, automates the transformation, transformation of static, Part Consistency Module

备注: 23 pages, 20 figures, preprint version only

点击查看摘要

Abstract:We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Current professional workflows require tedious manual segmentation and the artistic ``hallucination'' of occluded regions to enable motion. Our approach overcomes this by decomposing a single image into fully inpainted, semantically distinct layers with inferred drawing orders. To address the scarcity of training data, we introduce a scalable engine that bootstraps high-quality supervision from commercial Live2D models, capturing pixel-perfect semantics and hidden geometry. Our methodology couples a diffusion-based Body Part Consistency Module, which enforces global geometric coherence, with a pixel-level pseudo-depth inference mechanism. This combination resolves the intricate stratification of anime characters, e.g., interleaving hair strands, allowing for dynamic layer reconstruction. We demonstrate that our approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications.

18. 【2602.03747】LIVE: Long-horizon Interactive Video World Modeling

链接https://arxiv.org/abs/2602.03747

作者:Junchao Huang,Ziyang Ye,Xinting Hu,Tianyu He,Guiyu Zhang,Shaoshuai Shi,Jiang Bian,Li Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:predict future visual, future visual observations, visual observations conditioned, Autoregressive video world, models predict future

备注: 18 pages, 22 figures

点击查看摘要

Abstract:Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.

19. 【2602.03742】Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment

链接https://arxiv.org/abs/2602.03742

作者:Johny J. Lopez,Md Meftahul Ferdaus,Mahdi Abdelguerfi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:urban sustainability, sewer and culvert, critical to public, public safety, safety and urban

备注

点击查看摘要

Abstract:Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.

20. 【2602.03733】RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

链接https://arxiv.org/abs/2602.03733

作者:Wenfang Sun,Hao Chen,Yingjun Du,Yefeng Zheng,Cees G. M. Snoek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, achieved remarkable progress, existing systems rely, iteratively refine understanding, multiple visual contexts

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.

21. 【2602.03673】Referring Industrial Anomaly Segmentation

链接https://arxiv.org/abs/2602.03673

作者:Pengfei Yue,Xiaokang Jiang,Yilin Lu,Jianghang Lin,Shengchuan Zhang,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face significant challenges, unsupervised approaches yield, approaches yield rough, yield rough localizations, rough localizations requiring

备注

点击查看摘要

Abstract:Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the "One Anomaly Class, One Model" limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only "Anomaly" and "Background" tokens for efficient visual-textual integration. Experiments demonstrate RIAS's effectiveness in advancing IAD toward open-set capabilities. Code: this https URL.

22. 【2602.03669】Efficient Sequential Neural Network with Spatial-Temporal Attention and Linear LSTM for Robust Lane Detection Using Multi-Frame Images

链接https://arxiv.org/abs/2602.03669

作者:Sandeep Patil,Yongqi Dong,Haneen Farah,Hans Hellendoorn

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Driver Assistance Systems, Advanced Driver Assistance, Assistance Systems, Advanced Driver, Driver Assistance

备注: 14 pages, 9 figures, under review by IEEE T-ITS

点击查看摘要

Abstract:Lane detection is a crucial perception task for all levels of automated vehicles (AVs) and Advanced Driver Assistance Systems, particularly in mixed-traffic environments where AVs must interact with human-driven vehicles (HDVs) and challenging traffic scenarios. Current methods lack versatility in delivering accurate, robust, and real-time compatible lane detection, especially vision-based methods often neglect critical regions of the image and their spatial-temporal (ST) salience, leading to poor performance in difficult circumstances such as serious occlusion and dazzle lighting. This study introduces a novel sequential neural network model with a spatial-temporal attention mechanism to focus on key features of lane lines and exploit salient ST correlations among continuous image frames. The proposed model, built on a standard encoder-decoder structure and common neural network backbones, is trained and evaluated on three large-scale open-source datasets. Extensive experiments demonstrate the strength and robustness of the proposed model, outperforming state-of-the-art methods in various testing scenarios. Furthermore, with the ST attention mechanism, the developed sequential neural network models exhibit fewer parameters and reduced Multiply-Accumulate Operations (MACs) compared to baseline sequential models, highlighting their computational efficiency. Relevant data, code, and models are released at this https URL.

23. 【2602.03668】MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

链接https://arxiv.org/abs/2602.03668

作者:Jung Min Lee,Dohyeok Lee,Seokhun Ju,Taehyun Cho,Jin Woo Koo,Li Zhao,Sangwoo Hong,Jungwoo Lee

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:scaling robot learning, embodiment-specific robot datasets, enables scaling robot, robot learning, latent actions

备注

点击查看摘要

Abstract:Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

24. 【2602.03665】MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment

链接https://arxiv.org/abs/2602.03665

作者:Eunkyu Park,Wesley Hanwen Deng,Cheyon Jin,Matheus Kunzler Maldaner,Jordan Wheeler,Jason I. Hong,Hong Shen,Adam Perer,Ken Holstein,Motahhare Eslami,Gunhee Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:socially ambiguous contexts, make morally salient, morally salient judgments, Vision-Language Models, continue to struggle

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.

25. 【2602.03634】SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection

链接https://arxiv.org/abs/2602.03634

作者:Wei Zhang,Xiang Liu,Ningjing Liu,Mingxin Liu,Wei Liao,Chunyan Xu,Xue Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:oriented object detection, oriented object, object detection, maintaining comparable performance, consistent trend

备注: The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:A consistent trend throughout the research of oriented object detection has been the pursuit of maintaining comparable performance with fewer and weaker annotations. This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs. Based on the supervision level, existing oriented object detection algorithms can be broadly grouped into fully supervised, semi-supervised, and weakly supervised methods. Within the scope of this work, we further categorize them to include sparsely supervised and partially weakly-supervised methods. To address the challenges of large-scale labeling, we introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection framework, designed to efficiently leverage only a few sparse weakly-labeled data and plenty of unlabeled data. Our framework incorporates three key innovations: (1) We design a Sparse-annotation-Orientation-and-Scale-aware Student (SOS-Student) model to separate unlabeled objects from the background in a sparsely-labeled setting, and learn orientation and scale information from orientation-agnostic or scale-agnostic weak annotations. (2) We construct a novel Multi-level Pseudo-label Filtering strategy that leverages the distribution of model predictions, which is informed by the model's multi-layer predictions. (3) We propose a unique sparse partitioning approach, ensuring equal treatment for each category. Extensive experiments on the DOTA and DIOR datasets show that our framework achieves a significant performance gain over traditional oriented object detection methods mentioned above, offering a highly cost-effective solution. Our code is publicly available at this https URL.

26. 【2602.03625】Multi-Objective Optimization for Synthetic-to-Real Style Transfer

链接https://arxiv.org/abs/2602.03625

作者:Estelle Chigot,Thomas Oberlin,Manon Huguenin,Dennis Wilson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic segmentation networks, pixel-level annotated data, Semantic segmentation, amounts of pixel-level, pixel-level annotated

备注: Accepted in International Conference on the Applications of Evolutionary Computation (Part of EvoStar), April 2026 (EvoApplications 2026)

点击查看摘要

Abstract:Semantic segmentation networks require large amounts of pixel-level annotated data, which are costly to obtain for real-world images. Computer graphics engines can generate synthetic images alongside their ground-truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi-objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired-image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic-to-real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: this https URL.

27. 【2602.03622】Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis

链接https://arxiv.org/abs/2602.03622

作者:Lu Zhang,Huizhen Yu,Zuowei Wang,Fu Gui,Yatu Guo,Wei Zhang,Mengyu Jia

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Retinal diseases spanning, spanning a broad, broad spectrum, effectively identified, identified and diagnosed

备注

点击查看摘要

Abstract:Retinal diseases spanning a broad spectrum can be effectively identified and diagnosed using complementary signals from multimodal data. However, multimodal diagnosis in ophthalmic practice is typically challenged in terms of data heterogeneity, potential invasiveness, registration complexity, and so on. As such, a unified framework that integrates multimodal data synthesis and fusion is proposed for retinal disease classification and grading. Specifically, the synthesized multimodal data incorporates fundus fluorescein angiography (FFA), multispectral imaging (MSI), and saliency maps that emphasize latent lesions as well as optic disc/cup regions. Parallel models are independently trained to learn modality-specific representations that capture cross-pathophysiological signatures. These features are then adaptively calibrated within and across modalities to perform information pruning and flexible integration according to downstream tasks. The proposed learning system is thoroughly interpreted through visualizations in both image and feature spaces. Extensive experiments on two public datasets demonstrated the superiority of our approach over state-of-the-art ones in the tasks of multi-label classification (F1-score: 0.683, AUC: 0.953) and diabetic retinopathy grading (Accuracy:0.842, Kappa: 0.861). This work not only enhances the accuracy and efficiency of retinal disease screening but also offers a scalable framework for data augmentation across various medical imaging modalities.

28. 【2602.03615】KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs

链接https://arxiv.org/abs/2602.03615

作者:Baiyang Song,Jun Peng,Yuxin Zhang,Guangyao Chen,Feidiao Yang,Jianyuan Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision language models, costly video-specific training, pre-trained vision language, strong image comprehension, image comprehension capabilities

备注

点击查看摘要

Abstract:Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.

29. 【2602.03604】A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

链接https://arxiv.org/abs/2602.03604

作者:Basile Terver,Randall Balestriero,Megi Dervishi,David Fan,Quentin Garrido,Tushar Nagarajan,Koustuv Sinha,Wancong Zhang,Mike Rabbat,Yann LeCun,Amir Bar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Joint-Embedding Predictive Architectures, Predictive Architectures, present EB-JEPA, Joint-Embedding Predictive, EB-JEPA

备注

点击查看摘要

Abstract:We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at this https URL.

30. 【2602.03595】Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

链接https://arxiv.org/abs/2602.03595

作者:Haichao Jiang,Tianming Liang,Wei-Shi Zheng,Jian-Fang Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring Video Object, Video Object Segmentation, Referring Video, Object Segmentation, Video Object

备注

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.

31. 【2602.03594】IPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

链接https://arxiv.org/abs/2602.03594

作者:Alireza Salehi,Ehsan Karami,Sepehr Noey,Sahand Noey,Makoto Yamada,Reshad Hosseini,Mohammad Sabokrou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anomaly detection identifies, detection identifies departures, safety-critical settings, zero-shot anomaly detection, identifies departures

备注: This is the extended version of the paper accepted in ICASSP'26, which will be publicly available in May. Authors' contributions may vary among the versions

点击查看摘要

Abstract:Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at this http URL.

32. 【2602.03591】High-Resolution Underwater Camouflaged Object Detection: GBU-UCOD Dataset and Topology-Aware and Frequency-Decoupled Networks

链接https://arxiv.org/abs/2602.03591

作者:Wenji Wu,Shuo Ye,Yiyu Liu,Jiguang He,Zhuo Wang,Zitong Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Camouflaged Object Detection, Underwater Camouflaged Object, challenging task due, extreme visual similarity, Object Detection

备注

点击查看摘要

Abstract:Underwater Camouflaged Object Detection (UCOD) is a challenging task due to the extreme visual similarity between targets and backgrounds across varying marine depths. Existing methods often struggle with topological fragmentation of slender creatures in the deep sea and the subtle feature extraction of transparent organisms. In this paper, we propose DeepTopo-Net, a novel framework that integrates topology-aware modeling with frequency-decoupled perception. To address physical degradation, we design the Water-Conditioned Adaptive Perceptor (WCAP), which employs Riemannian metric tensors to dynamically deform convolutional sampling fields. Furthermore, the Abyssal-Topology Refinement Module (ATRM) is developed to maintain the structural connectivity of spindly targets through skeletal priors. Specifically, we first introduce GBU-UCOD, the first high-resolution (2K) benchmark tailored for marine vertical zonation, filling the data gap for hadal and abyssal zones. Extensive experiments on MAS3K, RMAS, and our proposed GBU-UCOD datasets demonstrate that DeepTopo-Net achieves state-of-the-art performance, particularly in preserving the morphological integrity of complex underwater patterns. The datasets and codes will be released at this https URL.

33. 【2602.03589】SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

链接https://arxiv.org/abs/2602.03589

作者:Ming Nie,Dan Ding,Chunwei Wang,Yuanfan Guo,Jianhua Han,Hang Xu,Li Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large language models, Large language, analyze video data, demonstrated exceptional capabilities, language models

备注: NeurIPS 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.

34. 【2602.03558】ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

链接https://arxiv.org/abs/2602.03558

作者:Xinyue Li,Zhiming Xu,Zhichao Zhang,Zhaolin Cai,Sijing Wu,Xiongkuo Min,Yitong Chen,Guangtao Zhai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:rendering previously collected, previously collected labels, collected labels unreliable, perceptual quality ceiling, unprecedented pace

备注

点击查看摘要

Abstract:Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.

35. 【2602.03555】Cut to the Mix: Simple Data Augmentation Outperforms Elaborate Ones in Limited Organ Segmentation Datasets

链接https://arxiv.org/abs/2602.03555

作者:Chang Liu,Fuxin Fan,Annette Schwarz,Andreas Maier

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:widely applied clinical, applied clinical routine, segmentation tools dramatically, tools dramatically improve, widely applied

备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the interimage and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that Cut-Mix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively 'wrong' images. Our implementation is publicly available for future benchmarks.

36. 【2602.03547】AffordanceGrasp-R1:Leveraging Reasoning-Based Affordance Segmentation with Reinforcement Learning for Robotic Grasping

链接https://arxiv.org/abs/2602.03547

作者:Dingyi Zhou,Mu He,Zhuowei Fang,Xiangtong Yao,Yinlong Liu,Alois Knoll,Hu Cao

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:reasoning-driven affordance segmentation, affordance segmentation framework, cold-start strategy, spatial grounding, segmentation framework

备注: Preprint version

点击查看摘要

Abstract:We introduce AffordanceGrasp-R1, a reasoning-driven affordance segmentation framework for robotic grasping that combines a chain-of-thought (CoT) cold-start strategy with reinforcement learning to enhance deduction and spatial grounding. In addition, we redesign the grasping pipeline to be more context-aware by generating grasp candidates from the global scene point cloud and subsequently filtering them using instruction-conditioned affordance masks. Extensive experiments demonstrate that AffordanceGrasp-R1 consistently outperforms state-of-the-art (SOTA) methods on benchmark datasets, and real-world robotic grasping evaluations further validate its robustness and generalization under complex language-conditioned manipulation scenarios.

37. 【2602.03538】Constrained Dynamic Gaussian Splatting

链接https://arxiv.org/abs/2602.03538

作者:Zihan Zheng,Zhenglong Wu,Xuanxuan Wang,Houqiang Zhong,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai,Wenjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Dynamic Gaussian Splatting, Gaussian Splatting enables, Splatting enables high-fidelity, Constrained Dynamic Gaussian, unconstrained densification leads

备注

点击查看摘要

Abstract:While Dynamic Gaussian Splatting enables high-fidelity 4D reconstruction, its deployment is severely hindered by a fundamental dilemma: unconstrained densification leads to excessive memory consumption incompatible with edge devices, whereas heuristic pruning fails to achieve optimal rendering quality under preset Gaussian budgets. In this work, we propose Constrained Dynamic Gaussian Splatting (CDGS), a novel framework that formulates dynamic scene reconstruction as a budget-constrained optimization problem to enforce a strict, user-defined Gaussian budget during training. Our key insight is to introduce a differentiable budget controller as the core optimization driver. Guided by a multi-modal unified importance score, this controller fuses geometric, motion, and perceptual cues for precise capacity regulation. To maximize the utility of this fixed budget, we further decouple the optimization of static and dynamic elements, employing an adaptive allocation mechanism that dynamically distributes capacity based on motion complexity. Furthermore, we implement a three-phase training strategy to seamlessly integrate these constraints, ensuring precise adherence to the target count. Coupled with a dual-mode hybrid compression scheme, CDGS not only strictly adheres to hardware constraints (error 2%}) but also pushes the Pareto frontier of rate-distortion performance. Extensive experiments demonstrate that CDGS delivers optimal rendering quality under varying capacity limits, achieving over 3x compression compared to state-of-the-art methods.

38. 【2602.03533】PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation

链接https://arxiv.org/abs/2602.03533

作者:Yongwei Chen,Tianyi Wei,Yushi Lan,Zhaoyang Lyu,Shangchen Zhou,Xudong Xu,Xingang Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rapid progress, inspired efforts, large multimodal models, models, understanding

备注: Yongwei Chen and Tianyi Wei contributed equally. Project page: [this https URL](https://cyw-3d.github.io/PnP-U3D/)

点击查看摘要

Abstract:The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.

39. 【2602.03531】Robust Representation Learning in Masked Autoencoders

链接https://arxiv.org/abs/2602.03531

作者:Anika Shrivastava,Renu Rameshan,Samar Agnihotri

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Masked Autoencoders, achieve impressive performance, image classification tasks, achieve impressive, remain less understood

备注: 11 pages, 8 figures, and 3 tables

点击查看摘要

Abstract:Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.

40. 【2602.03530】Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning

链接https://arxiv.org/abs/2602.03530

作者:Xufei Zhang,Xinjiao Zhou,Ziling Deng,Dongdong Geng,Jianxiong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatial layout, Logical Anomaly Classification, object quantity, compositional relationships, Logical Anomaly

备注: 6 pages, 6 figures

点击查看摘要

Abstract:Logical anomalies are violations of predefined constraints on object quantity, spatial layout, and compositional relationships in industrial images. While prior work largely treats anomaly detection as a binary decision, such formulations cannot indicate which logical rule is broken and therefore offer limited value for quality assurance. We introduce Logical Anomaly Classification (LAC), a task that unifies anomaly detection and fine-grained violation classification in a single inference step. To tackle LAC, we propose LogiCls, a vision-language framework that decomposes complex logical constraints into a sequence of verifiable subqueries. We further present a data-centric instruction synthesis pipeline that generates chain-of-thought (CoT) supervision for these subqueries, coupling precise grounding annotations with diverse image-text augmentations to adapt vision language models (VLMs) to logic-sensitive reasoning. Training is stabilized by a difficulty-aware resampling strategy that emphasizes challenging subqueries and long tail constraint types. Extensive experiments demonstrate that LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both the predicted violation categories and their evidence trails.

41. 【2602.03510】Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

链接https://arxiv.org/abs/2602.03510

作者:Bozhou Li,Yushuo Guan,Haolin Li,Bohan Zeng,Yiyan Ji,Yue Ding,Pengfei Wan,Kun Gai,Yuanxing Zhang,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single LLM layer, increasingly adopt LLMs, remains largely static, pronounced semantic hierarchy, models increasingly adopt

备注

点击查看摘要

Abstract:Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model's generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.

42. 【2602.03491】Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

链接https://arxiv.org/abs/2602.03491

作者:Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Youcheng Pan,Xiaoqiang Zhou,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Vision-Language Models, images remains challenging, Vision-Language Models, challenging for Large, Large Vision-Language

备注

点击查看摘要

Abstract:Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.

43. 【2602.03473】Scaling Continual Learning with Bi-Level Routing Mixture-of-Experts

链接https://arxiv.org/abs/2602.03473

作者:Meng Lou,Yunxiang Fu,Yizhou Yu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered substantial research, substantial research interest, class-incremental learning, pre-trained model, recent years

备注

点击查看摘要

Abstract:Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging evaluation protocol for comprehensively assessing CIL methods across very long task sequences spanning hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. Code will be publicly released at this https URL.

44. 【2602.03472】Inlier-Centric Post-Training Quantization for Object Detection Models

链接https://arxiv.org/abs/2602.03472

作者:Minsu Kim,Dongyeun Lee,Jaemyung Yu,Jiwan Hur,Giseop Kim,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:immense computational demands, computational demands make, demands make deployment, make deployment slow, computer vision

备注

点击查看摘要

Abstract:Object detection is pivotal in computer vision, yet its immense computational demands make deployment slow and power-hungry, motivating quantization. However, task-irrelevant morphologies such as background clutter and sensor noise induce redundant activations (or anomalies). These anomalies expand activation ranges and skew activation distributions toward task-irrelevant responses, complicating bit allocation and weakening the preservation of informative features. Without a clear criterion to distinguish anomalies, suppressing them can inadvertently discard useful information. To address this, we present InlierQ, an inlier-centric post-training quantization approach that separates anomalies from informative inliers. InlierQ computes gradient-aware volume saliency scores, classifies each volume as an inlier or anomaly, and fits a posterior distribution over these scores using the Expectation-Maximization (EM) algorithm. This design suppresses anomalies while preserving informative features. InlierQ is label-free, drop-in, and requires only 64 calibration samples. Experiments on the COCO and nuScenes benchmarks show consistent reductions in quantization error for camera-based (2D and 3D) and LiDAR-based (3D) object detection.

45. 【2602.03454】Contextualized Visual Personalization in Vision-Language Models

链接https://arxiv.org/abs/2602.03454

作者:Yeongtak Oh,Sangwon Yu,Junsung Park,Han Cheol Moon,Jisoo Mok,Sungroh Yoon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:user accumulated visual-textual, user specific experiences, accumulated visual-textual context, generate personalized responses, personalized responses based

备注: Project Page: [this https URL](https://github.com/oyt9306/CoViP)

点击查看摘要

Abstract:Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

46. 【2602.03448】Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

链接https://arxiv.org/abs/2602.03448

作者:Yijia Xu,Zihao Wang,Jinshi Cui

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:image generation aims, textual instructions, aims to synthesize, faithfully preserve, preserve the identities

备注

点击查看摘要

Abstract:Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.

47. 【2602.03447】HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic

链接https://arxiv.org/abs/2602.03447

作者:Yu-Hsiang Chen,Wei-Jer Chang,Christian Kotulla,Thomas Keutgens,Steffen Runde,Tobias Moers,Christoph Klas,Wei Zhan,Masayoshi Tomizuka,Yi-Ting Chen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:developing autonomous driving, autonomous driving systems, driving systems, heterogeneous environments, present HetroD

备注: IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navi- gating real-world heterogeneous traffic dominated by vulner- able road users (VRUs), including pedestrians, cyclists, and motorcyclists that interact with vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large- scale drone-based dataset to provide a holistic observation of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high- fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, het- erogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: this https URL

48. 【2602.03425】ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning

链接https://arxiv.org/abs/2602.03425

作者:Xiaofeng Tan,Jun Liu,Yuanting Fan,Bin-Bin Gao,Xi Jiang,Xiaochen Chen,Jinlong Peng,Chengjie Wang,Hongsong Wang,Feng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Fine-Tuning, preference alignment, crucial for preference, Reinforcement, visual hallucinations

备注

点击查看摘要

Abstract:Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model's foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model's consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49\% for low-level and 38\% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1\% (v.s. the baseline's decrease of -0.4\%) over this http URL. This is \href{this https URL}{Project Page}.

49. 【2602.03423】Origin Lens: A Privacy-First Mobile Framework for Cryptographic Image Provenance and AI Detection

链接https://arxiv.org/abs/2602.03423

作者:Alexander Loth,Dominique Conceicao Rosario,Peter Ebinger,Martin Kappes,Marc-Oliver Pahl

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:information integrity assurance, connect model governance, Origin Lens, integrity assurance, present Origin Lens

备注: Accepted at ACM TheWebConf '26 Companion

点击查看摘要

Abstract:The proliferation of generative AI poses challenges for information integrity assurance, requiring systems that connect model governance with end-user verification. We present Origin Lens, a privacy-first mobile framework that targets visual disinformation through a layered verification architecture. Unlike server-side detection systems, Origin Lens performs cryptographic image provenance verification and AI detection locally on the device via a Rust/Flutter hybrid architecture. Our system integrates multiple signals - including cryptographic provenance, generative model fingerprints, and optional retrieval-augmented verification - to provide users with graded confidence indicators at the point of consumption. We discuss the framework's alignment with regulatory requirements (EU AI Act, DSA) and its role in verification infrastructure that complements platform-level mechanisms.

50. 【2602.03414】Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

链接https://arxiv.org/abs/2602.03414

作者:Zhengbo Jiao,Shaobo Wang,Zifan Zhang,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, advanced vision-language understanding

备注: 18pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).

51. 【2602.03410】UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning

链接https://arxiv.org/abs/2602.03410

作者:Piotr Wójcik,Maksym Petrenko,Wojciech Gromski,Przemysław Spurek,Maciej Zieba

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, socially disruptive content, potential misuse, advances in large-scale, intensified concerns

备注

点击查看摘要

Abstract:Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. Repository: this https URL.

52. 【2602.03390】From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning

链接https://arxiv.org/abs/2602.03390

作者:Hyun Seok Seong,WonJun Moon,Jae-Pil Heo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:decomposing complex scenes, shown great promise, Unsupervised object-centric learning, blurry reconstruction maps, object-centric learning models

备注: ICLR 2026; Code is available at [this https URL](https://github.com/hynnsk/SRL)

点击查看摘要

Abstract:Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at this https URL.

53. 【2602.03380】Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization

链接https://arxiv.org/abs/2602.03380

作者:Hao Fang,Jinyu Li,Jiawei Kong,Tianqu Zhuang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibited impressive capabilities, textbf, impressive capabilities, exhibited impressive, remain prone

备注

点击查看摘要

Abstract:While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbf{C}hain-of-Thought \textbf{C}ompression and \textbf{C}ontrastive \textbf{P}reference \textbf{O}ptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models' reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models' inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.

54. 【2602.03376】PlanTRansformer: Unified Prediction and Planning with Goal-conditioned Transformer

链接https://arxiv.org/abs/2602.03376

作者:Constantin Selzer,Fabina B. Flohr

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Trajectory prediction, autonomous driving, components in autonomous, Trajectory, prediction

备注: Submitted and accepted at IEEE IV 2026

点击查看摘要

Abstract:Trajectory prediction and planning are fundamental yet disconnected components in autonomous driving. Prediction models forecast surrounding agent motion under unknown intentions, producing multimodal distributions, while planning assumes known ego objectives and generates deterministic trajectories. This mismatch creates a critical bottleneck: prediction lacks supervision for agent intentions, while planning requires this information. Existing prediction models, despite strong benchmarking performance, often remain disconnected from planning constraints such as collision avoidance and dynamic feasibility. We introduce Plan TRansformer (PTR), a unified Gaussian Mixture Transformer framework integrating goal-conditioned prediction, dynamic feasibility, interaction awareness, and lane-level topology reasoning. A teacher-student training strategy progressively masks surrounding agent commands during training to align with inference conditions where agent intentions are unavailable. PTR achieves 4.3%/3.5% improvement in marginal/joint mAP compared to the baseline Motion Transformer (MTR) and 15.5% planning error reduction at 5s horizon compared to GameFormer. The architecture-agnostic design enables application to diverse Transformer-based prediction models. Project Website: this https URL

55. 【2602.03373】Unifying Watermarking via Dimension-Aware Mapping

链接https://arxiv.org/abs/2602.03373

作者:Jiale Meng,Runyi Hu,Jie Zhang,Zheming Lu,Ivor Tsang,Tianwei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:similar encoder-decoder architectures, share similar encoder-decoder, Deep watermarking methods, Deep watermarking, encoder-decoder architectures

备注: 29 pages, 25 figures

点击查看摘要

Abstract:Deep watermarking methods often share similar encoder-decoder architectures, yet differ substantially in their functional behaviors. We propose DiM, a new multi-dimensional watermarking framework that formulates watermarking as a dimension-aware mapping problem, thereby unifying existing watermarking methods at the functional level. Under DiM, watermark information is modeled as payloads of different dimensionalities, including one-dimensional binary messages, two-dimensional spatial masks, and three-dimensional spatiotemporal structures. We find that the dimensional configuration of embedding and extraction largely determines the resulting watermarking behavior. Same-dimensional mappings preserve payload structure and support fine-grained control, while cross-dimensional mappings enable spatial or spatiotemporal localization. We instantiate DiM in the video domain, where spatiotemporal representations enable a broader set of dimension mappings. Experiments demonstrate that varying only the embedding and extraction dimensions, without architectural changes, leads to different watermarking capabilities, including spatiotemporal tamper localization, local embedding control, and recovery of temporal order under frame disruptions.

56. 【2602.03372】SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI

链接https://arxiv.org/abs/2602.03372

作者:Mario Pascual-González,Ariadna Jiménez-Partinen,R.M. Luque-Baena,Fátima Nagib-Raya,Ezequiel López-Rubio

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:epilepsy FLAIR MRI, Focal cortical dysplasia, generative modeling prone, FLAIR MRI, mask generative modeling

备注: 6 pages, 2 figures, 1 table, conference paper

点击查看摘要

Abstract:Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image--mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable $L_p$ objective. As an internal baseline, we include the canonical DDPM-style objective ($\epsilon$-prediction with $L_2$ loss) and isolate the effect of prediction parameterization and $L_p$ geometry under a matched setup. Experiments show that $x_0$-prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ($L_{1.5}$) improve image fidelity while $L_2$ better preserves lesion mask morphology. Our code and model weights are available in this https URL

57. 【2602.03371】Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion

链接https://arxiv.org/abs/2602.03371

作者:Zhiwen Yang,Yuxin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, voxel-level scene perception, scene perception foundation, semantic scene completion, autonomous driving

备注: 15 pages, 6 figures, accepted by TIP 2026

点击查看摘要

Abstract:Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a \textit{Multi-Resolution Alignment (MRA)} approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. The code is available at this https URL.

58. 【2602.03370】Symbol-Aware Reasoning with Masked Discrete Diffusion for Handwritten Mathematical Expression Recognition

链接https://arxiv.org/abs/2602.03370

作者:Takaya Kawakatsu,Ryo Ishiyama

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Handwritten Mathematical Expression, Mathematical Expression Recognition, autoregressive models struggle, Handwritten Mathematical, Mathematical Expression

备注

点击查看摘要

Abstract:Handwritten Mathematical Expression Recognition (HMER) requires reasoning over diverse symbols and 2D structural layouts, yet autoregressive models struggle with exposure bias and syntactic inconsistency. We present a discrete diffusion framework that reformulates HMER as iterative symbolic refinement instead of sequential generation. Through multi-step remasking, the proposal progressively refines both symbols and structural relations, removing causal dependencies and improving structural consistency. A symbol-aware tokenization and Random-Masking Mutual Learning further enhance syntactic alignment and robustness to handwriting diversity. On the MathWriting benchmark, the proposal achieves 5.56\% CER and 60.42\% EM, outperforming strong Transformer and commercial baselines. Consistent gains on CROHME 2014--2023 demonstrate that discrete diffusion provides a new paradigm for structure-aware visual recognition beyond generative modeling.

59. 【2602.03361】Z3D: Zero-Shot 3D Visual Grounding from Images

链接https://arxiv.org/abs/2602.03361

作者:Nikita Drozdov,Andrey Lemeshko,Nikita Gavrilov,Anton Konushin,Danila Rukhovich,Maksim Kolodiazhnyi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language queries, aims to localize, scene based, language queries, based on natural

备注

点击查看摘要

Abstract:3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at this https URL .

60. 【2602.03342】d Prompts: Overcoming Prompt Underspecification in Image and Video Super-Resolution

链接https://arxiv.org/abs/2602.03342

作者:Bryan Sangwoo Kim,Jonghyun Park,Jong Chul Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:pipelines typically rely, Text-conditioned diffusion models, single global caption, modern super-resolution pipelines, super-resolution pipelines typically

备注: 13 pages, 7 figures

点击查看摘要

Abstract:Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.

61. 【2602.03339】Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

链接https://arxiv.org/abs/2602.03339

作者:Bingchen Zhao,Qiushan Guo,Ye Wang,Yixuan Huang,Zhonghua Zhai,Yu Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learning visual tokenizers, training framework, framework for learning, learning visual, tokens

备注

点击查看摘要

Abstract:We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.

62. 【2602.03333】PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets

链接https://arxiv.org/abs/2602.03333

作者:Haoran Li,Renyang Liu,Hongjia Liu,Chen Wang,Long Yin,Jian Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high attack performance, presents significant challenges, achieving spatial imperceptibility, Recent progress, attack performance

备注: Accepted by WWW 2026

点击查看摘要

Abstract:Recent progress in adversarial attacks on 3D point clouds, particularly in achieving spatial imperceptibility and high attack performance, presents significant challenges for defenders. Current defensive approaches remain cumbersome, often requiring invasive model modifications, expensive training procedures or auxiliary data access. To address these threats, in this paper, we propose a plug-and-play and non-invasive defense mechanism in the spectral domain, grounded in a theoretical and empirical analysis of the relationship between imperceptible perturbations and high-frequency spectral components. Building upon these insights, we introduce a novel purification framework, termed PWAVEP, which begins by computing a spectral graph wavelet domain saliency score and local sparsity score for each point. Guided by these values, PWAVEP adopts a hierarchical strategy, it eliminates the most salient points, which are identified as hardly recoverable adversarial outliers. Simultaneously, it applies a spectral filtering process to a broader set of moderately salient points. This process leverages a graph wavelet transform to attenuate high-frequency coefficients associated with the targeted points, thereby effectively suppressing adversarial noise. Extensive evaluations demonstrate that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches, advancing the state-of-the-art in 3D point cloud purification. Code and datasets are available at this https URL

63. 【2602.03327】Pi-GS: Sparse-View Gaussian Splatting with Dense π^3 Initialization

链接https://arxiv.org/abs/2602.03327

作者:Manuel Hofer,Markus Steinberger,Thomas Köhler

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Gaussian Splatting, compromising visual fidelity, Neural Radiance, Radiance Fields

备注

点击查看摘要

Abstract:Novel view synthesis has evolved rapidly, advancing from Neural Radiance Fields to 3D Gaussian Splatting (3DGS), which offers real-time rendering and rapid training without compromising visual fidelity. However, 3DGS relies heavily on accurate camera poses and high-quality point cloud initialization, which are difficult to obtain in sparse-view scenarios. While traditional Structure from Motion (SfM) pipelines often fail in these settings, existing learning-based point estimation alternatives typically require reliable reference views and remain sensitive to pose or depth errors. In this work, we propose a robust method utilizing {\pi}^3, a reference-free point cloud estimation network. We integrate dense initialization from {\pi}^3 with a regularization scheme designed to mitigate geometric inaccuracies. Specifically, we employ uncertainty-guided depth supervision, normal consistency loss, and depth warping. Experimental results demonstrate that our approach achieves state-of-the-art performance on the Tanks and Temples, LLFF, DTU, and MipNeRF360 datasets.

64. 【2602.03320】MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning

链接https://arxiv.org/abs/2602.03320

作者:Shengyuan Liu,Liuxin Bao,Qi Yang,Wanting Geng,Boyun Zheng,Chenxin Li,Wenting Chen,Houwen Peng,Yixuan Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multi-modal Large Language, Large Language Models, leverages Multi-modal Large, evolving from task-specific, Medical image segmentation

备注: 23 Pages, 4 Figures

点击查看摘要

Abstract:Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \href{this https URL}{here}.

65. 【2602.03316】Invisible Clean-Label Backdoor Attacks for Generative Data Augmentation

链接https://arxiv.org/abs/2602.03316

作者:Ting Xiang,Jinhui Zhao,Changjian Chen,Zhuo Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generative data augmentation, enrich training images, data augmentation, generative data, image generative models

备注

点击查看摘要

Abstract:With the rapid advancement of image generative models, generative data augmentation has become an effective way to enrich training images, especially when only small-scale datasets are available. At the same time, in practical applications, generative data augmentation can be vulnerable to clean-label backdoor attacks, which aim to bypass human inspection. However, based on theoretical analysis and preliminary experiments, we observe that directly applying existing pixel-level clean-label backdoor attack methods (e.g., COMBAT) to generated images results in low attack success rates. This motivates us to move beyond pixel-level triggers and focus instead on the latent feature level. To this end, we propose InvLBA, an invisible clean-label backdoor attack method for generative data augmentation by latent perturbation. We theoretically prove that the generalization of the clean accuracy and attack success rates of InvLBA can be guaranteed. Experiments on multiple datasets show that our method improves the attack success rate by 46.43% on average, with almost no reduction in clean accuracy and high robustness against SOTA defense methods.

66. 【2602.03314】PQTNet: Pixel-wise Quantitative Thermography Neural Network for Estimating Defect Depth in Polylactic Acid Parts by Additive Manufacturing

链接https://arxiv.org/abs/2602.03314

作者:Lei Deng,Wenhao Huang,Chao Yang,Haoyuan Zheng,Yinbin Tian,Yue Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Thermography Neural Network, Defect depth quantification, Pixel-wise Quantitative Thermography, Quantitative Thermography Neural, additively manufactured

备注: Under review

点击查看摘要

Abstract:Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination (R) exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.

67. 【2602.03310】RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

链接https://arxiv.org/abs/2602.03310

作者:Songming Liu,Bangguo Li,Kai Ma,Lingxuan Wu,Hengkai Tan,Xiao Ouyang,Hang Su,Jun Zhu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:architectural inefficiencies, models hold promise, data scarcity, Universal Manipulation Interface, hold promise

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See this https URL for more information.

68. 【2602.03302】Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases

链接https://arxiv.org/abs/2602.03302

作者:Jinze Zhang,Jian Zhong,Li Lin,Jiaxiong Li,Ke Ma,Naiyang Li,Meng Li,Yuan Pan,Zeyu Meng,Mengyun Zhou,Shang Huang,Shilong Yu,Zhengyu Duan,Sutong Li,Honghui Xia,Juping Liu,Dan Liang,Yantao Wei,Xiaoying Tang,Jin Yuan,Peng Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Optical coherence tomography, three-dimensional imaging nature, clinical practices remains, practices remains constrained, conventional single-slice single-task

备注

点击查看摘要

Abstract:Optical coherence tomography (OCT) has revolutionized retinal disease diagnosis with its high-resolution and three-dimensional imaging nature, yet its full diagnostic automation in clinical practices remains constrained by multi-stage workflows and conventional single-slice single-task AI models. We present Full-process OCT-based Clinical Utility System (FOCUS), a foundation model-driven framework enabling end-to-end automation of 3D OCT retinal disease diagnosis. FOCUS sequentially performs image quality assessment with EfficientNetV2-S, followed by abnormality detection and multi-disease classification using a fine-tuned Vision Foundation Model. Crucially, FOCUS leverages a unified adaptive aggregation method to intelligently integrate 2D slices-level predictions into comprehensive 3D patient-level diagnosis. Trained and tested on 3,300 patients (40,672 slices), and externally validated on 1,345 patients (18,498 slices) across four different-tier centers and diverse OCT devices, FOCUS achieved high F1 scores for quality assessment (99.01%), abnormally detection (97.46%), and patient-level diagnosis (94.39%). Real-world validation across centers also showed stable performance (F1: 90.22%-95.24%). In human-machine comparisons, FOCUS matched expert performance in abnormality detection (F1: 95.47% vs 90.91%) and multi-disease diagnosis (F1: 93.49% vs 91.35%), while demonstrating better efficiency. FOCUS automates the image-to-diagnosis pipeline, representing a critical advance towards unmanned ophthalmology with a validated blueprint for autonomous screening to enhance population scale retinal care accessibility and efficiency.

69. 【2602.03300】R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

链接https://arxiv.org/abs/2602.03300

作者:Jingyi Zhang,Tianyi Lin,Huanjin Yao,Xiang Lan,Shunyu Liu,Jiaxing Huang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Collective Adversarial Data, complex real-world tasks, solving complex real-world, effective data synthesis, data synthesis techniques

备注

点击查看摘要

Abstract:In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

70. 【2602.03295】POP: Prefill-Only Pruning for Efficient Large Model Inference

链接https://arxiv.org/abs/2602.03295

作者:Junhui He,Zhihui Fu,Jun Wang,Qingan Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, remarkable capabilities

备注

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

71. 【2602.03294】LEVIO: Lightweight Embedded Visual Inertial Odometry for Resource-Constrained Devices

链接https://arxiv.org/abs/2602.03294

作者:Jonas Kühne,Christian Vogt,Michele Magno,Luca Benini

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:infrastructure-less sensor systems, infrastructure-less sensor, augmented reality, essential for mobile, mobile robotics

备注: This article has been accepted for publication in the IEEE Sensors Journal (JSEN)

点击查看摘要

Abstract:Accurate, infrastructure-less sensor systems for motion tracking are essential for mobile robotics and augmented reality (AR) applications. The most popular state-of-the-art visual-inertial odometry (VIO) systems, however, are too computationally demanding for resource-constrained hardware, such as micro-drones and smart glasses. This work presents LEVIO, a fully featured VIO pipeline optimized for ultra-low-power compute platforms, allowing six-degrees-of-freedom (DoF) real-time sensing. LEVIO incorporates established VIO components such as Oriented FAST and Rotated BRIEF (ORB) feature tracking and bundle adjustment, while emphasizing a computationally efficient architecture with parallelization and low memory usage to suit embedded microcontrollers and low-power systems-on-chip (SoCs). The paper proposes and details the algorithmic design choices and the hardware-software co-optimization approach, and presents real-time performance on resource-constrained hardware. LEVIO is validated on a parallel-processing ultra-low-power RISC-V SoC, achieving 20 FPS while consuming less than 100 mW, and benchmarked against public VIO datasets, offering a compelling balance between efficiency and accuracy. To facilitate reproducibility and adoption, the complete implementation is released as open-source.

72. 【2602.03292】A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation

链接https://arxiv.org/abs/2602.03292

作者:Jianghao Wu,Xiangde Luo,Yubo Zhou,Lianming Wu,Guotai Wang,Shaoting Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accessing source data, offers a practical, data or retraining, practical solution, solution for deploying

备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose \textbf{A3-TTA}, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at this https URL.

73. 【2602.03284】me Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks

链接https://arxiv.org/abs/2602.03284

作者:Yi Yu,Qixin Zhang,Shuhan Ye,Xun Lin,Qianshan Wei,Kun Wang,Wenhan Yang,Dacheng Tao,Xudong Jiang

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Spiking neural networks, Spiking neural, exploit temporal structure, attacks change intensities, neural networks

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure, yet most adversarial attacks change intensities or event counts instead of timing. We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs, thus remaining rate-preserving. We formalize a capacity-1 spike-retiming threat model with a unified trio of budgets: per-spike jitter $\mathcal{B}_{\infty}$, total delay $\mathcal{B}_{1}$, and tamper count $\mathcal{B}_{0}$. Feasible adversarial examples must satisfy timeline consistency and non-overlap, which makes the search space discrete and constrained. To optimize such retimings at scale, we use projected-in-the-loop (PIL) optimization: shift-probability logits yield a differentiable soft retiming for backpropagation, and a strict projection in the forward pass produces a feasible discrete schedule that satisfies capacity-1, non-overlap, and the chosen budget at every step. The objective maximizes task loss on the projected input and adds a capacity regularizer together with budget-aware penalties, which stabilizes gradients and aligns optimization with evaluation. Across event-driven benchmarks (CIFAR10-DVS, DVS-Gesture, N-MNIST) and diverse SNN architectures, we evaluate under binary and integer event grids and a range of retiming budgets, and also test models trained with timing-aware adversarial training designed to counter timing-only attacks. For example, on DVS-Gesture the attack attains high success (over $90\%$) while touching fewer than $2\%$ of spikes under $\mathcal{B}_{0}$. Taken together, our results show that spike retiming is a practical and stealthy attack surface that current defenses struggle to counter, providing a clear reference for temporal robustness in event-driven SNNs. Code is available at this https URL.

74. 【2602.03282】Global Geometry Is Not Enough for Vision Representations

链接https://arxiv.org/abs/2602.03282

作者:Jiwan Chung,Seon Joo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:globally well-distributed embeddings, well-distributed embeddings support, embeddings support robust, representation learning, generalizable representations

备注

点击查看摘要

Abstract:A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across 21 vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input-output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input-output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

75. 【2602.03264】HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis

链接https://arxiv.org/abs/2602.03264

作者:Francesco Di Salvo,Sebastian Doerrich,Jonas Alle,Christian Ledig

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:deep neural networks, training distributions remains, neural networks, training distributions, distributions remains

备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1\%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at this https URL .

76. 【2602.03253】LaVPR: Benchmarking Language and Vision for Place Recognition

链接https://arxiv.org/abs/2602.03253

作者:Ofer Idan,Dan Badur,Yosi Keller,Yoli Shavit

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Place Recognition, Visual Place, Place Recognition, perceptual aliasing, fails under extreme

备注

点击查看摘要

Abstract:Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at this https URL.

77. 【2602.03242】InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

链接https://arxiv.org/abs/2602.03242

作者:Zhuoran Yang,Xi Guo,Chenjing Ding,Chiyu Wang,Wei Wu,Yanyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale multi-view driving, robust models trained, trained on high-quality, large-scale multi-view, relies on robust

备注

点击查看摘要

Abstract:Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is this https URL.

78. 【2602.03230】EventFlash: Towards Efficient MLLMs for Event-Based Vision

链接https://arxiv.org/abs/2602.03230

作者:Shaoyu Liu,Jianing Li,Guanghui Zhao,Yunjian Zhang,Wen Jiang,Ming Li,Xiangyang Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enable robust perception, multimodal large language, Event-based multimodal large, addressing key limitations, large language models

备注

点击查看摘要

Abstract:Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.

79. 【2602.03227】Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane

链接https://arxiv.org/abs/2602.03227

作者:Haoyu Liu,Sucheng Ren,Tingyu Zhu,Peng Wang,Cihang Xie,Alan Yuille,Zeyu Zheng,Feng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Rotary Position Embedding, support length extrapolation, large language models, language models due, encode relative positions

备注

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analysis of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.

80. 【2602.03220】PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation

链接https://arxiv.org/abs/2602.03220

作者:Jingbang Tang(James)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality synthesis requires, fine-grained style expression, stable character structure, paper studies reference-free, studies reference-free style-conditioned

备注: 7 pages, 5 figures. Under review at IJCNN 2026

点击查看摘要

Abstract:This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment this http URL propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully this http URL Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different this http URL on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.

81. 【2602.03214】FARTrack: Fast Autoregressive Visual Tracking with High Performance

链接https://arxiv.org/abs/2602.03214

作者:Guijie Wang,Tong Lin,Yifan Bai,Anjia Cao,Shiyi Liang,Wangbo Zhao,Xing Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical evaluation metrics, Inter-frame Autoregressive Sparsification, critical evaluation, evaluation metrics, field of visual

备注

点击查看摘要

Abstract:Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.

82. 【2602.03213】ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

链接https://arxiv.org/abs/2602.03213

作者:Zhuoran Yang,Yanyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality multi-view driving, robust models trained, trained on large-scale, high-quality multi-view, Autonomous driving relies

备注

点击查看摘要

Abstract:Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is this https URL.

83. 【2602.03210】VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

链接https://arxiv.org/abs/2602.03210

作者:Zhiwen Li,Zhongjie Duan,Jinyan Ye,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Replicating In-Context Learning, computer vision remains, vision remains challenging, remains challenging due, Replicating In-Context

备注

点击查看摘要

Abstract:Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at this https URL

84. 【2602.03208】Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

链接https://arxiv.org/abs/2602.03208

作者:Jinyan Ye,Zhongjie Duan,Zhiwen Li,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Inference-time scaling offers, aligning visual generative, Inference-time scaling, parameter updates, visual generative models

备注

点击查看摘要

Abstract:Inference-time scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates. However, existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation. We show that this inefficiency is closely related to a spectral bias in generative dynamics: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Building on this insight, we propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace. Theoretically, we derive the Spectral Scaling Prediction from perturbation propagation dynamics, which explains the systematic differences in the impact of perturbations across frequencies. Extensive experiments demonstrate that SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets.

85. 【2602.03207】WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU

链接https://arxiv.org/abs/2602.03207

作者:Yudong Han,Chao Xu,Xiaodan Ye,Weichen Bi,Zilong Dong,Yun Ma

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

关键词:GPU rendering pipeline, GPU rendering, heterogeneous web ecosystem, rendering pipeline, GPU

备注

点击查看摘要

Abstract:We present WebSplatter, an end-to-end GPU rendering pipeline for the heterogeneous web ecosystem. Unlike naive ports, WebSplatter introduces a wait-free hierarchical radix sort that circumvents the lack of global atomics in WebGPU, ensuring deterministic execution across diverse hardware. Furthermore, we propose an opacity-aware geometry culling stage that dynamically prunes splats before rasterization, significantly reducing overdraw and peak memory footprint. Evaluation demonstrates that WebSplatter consistently achieves 1.2$\times$ to 4.5$\times$ speedups over state-of-the-art web viewers.

86. 【2602.03200】Hand3R: Online 4D Hand-Scene Reconstruction in the Wild

链接https://arxiv.org/abs/2602.03200

作者:Wendi Hu,Haonan Zhou,Wenhao Hu,Gaoang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:jointly reconstructing dynamic, understanding physical interaction, reconstructing dynamic hands, jointly reconstructing, physical interaction

备注

点击查看摘要

Abstract:For Embodied AI, jointly reconstructing dynamic hands and the dense scene context is crucial for understanding physical interaction. However, most existing methods recover isolated hands in local coordinates, overlooking the surrounding 3D environment. To address this, we present Hand3R, the first online framework for joint 4D hand-scene reconstruction from monocular video. Hand3R synergizes a pre-trained hand expert with a 4D scene foundation model via a scene-aware visual prompting mechanism. By injecting high-fidelity hand priors into a persistent scene memory, our approach enables simultaneous reconstruction of accurate hand meshes and dense metric-scale scene geometry in a single forward pass. Experiments demonstrate that Hand3R bypasses the reliance on offline optimization and delivers competitive performance in both local hand reconstruction and global positioning.

87. 【2602.03198】From Single Scan to Sequential Consistency: A New Paradigm for LIDAR Relocalization

链接https://arxiv.org/abs/2602.03198

作者:Minghang Zhu,Zhijing Wang,Yuxin Guo,Wen Li,Sheng Ao,Cheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:LiDAR relocalization aims, LiDAR relocalization, Global Coordinate Estimation, Prior Coordinate Generation, Coordinate Estimation module

备注: Nothing

点击查看摘要

Abstract:LiDAR relocalization aims to estimate the global 6-DoF pose of a sensor in the environment. However, existing regression-based approaches are prone to dynamic or ambiguous scenarios, as they either solely rely on single-frame inference or neglect the spatio-temporal consistency across scans. In this paper, we propose TempLoc, a new LiDAR relocalization framework that enhances the robustness of localization by effectively modeling sequential consistency. Specifically, a Global Coordinate Estimation module is first introduced to predict point-wise global coordinates and associated uncertainties for each LiDAR scan. A Prior Coordinate Generation module is then presented to estimate inter-frame point correspondences by the attention mechanism. Lastly, an Uncertainty-Guided Coordinate Fusion module is deployed to integrate both predictions of point correspondence in an end-to-end fashion, yielding a more temporally consistent and accurate global 6-DoF pose. Experimental results on the NCLT and Oxford Robot-Car benchmarks show that our TempLoc outperforms stateof-the-art methods by a large margin, demonstrating the effectiveness of temporal-aware correspondence modeling in LiDAR relocalization. Our code will be released soon.

88. 【2602.03182】LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution

链接https://arxiv.org/abs/2602.03182

作者:Tianxing Wu,Zheng Chen,Cirou Xu,Bowen Chai,Yong Guo,Yutong Liu,Linghe Kong,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated promising capability, Diffusion Transformers, One-Step Diffusion Models, demonstrated promising, promising capability

备注: Code is available at: [this https URL](https://github.com/zhengchen1999/LSGQuant)

点击查看摘要

Abstract:One-Step Diffusion Models have demonstrated promising capability and fast inference in video super-resolution (VSR) for real-world. Nevertheless, the substantial model size and high computational cost of Diffusion Transformers (DiTs) limit downstream applications. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To deal with these challenges, we introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques. Code is available at: this https URL.

89. 【2602.03176】BinaryDemoire: Moiré-Aware Binarization for Image Demoiréing

链接https://arxiv.org/abs/2602.03176

作者:Zheng Chen,Zhi Yang,Xiaoyang Liu,Weihang Zhang,Mengfan Wang,Yifan Fu,Linghe Kong,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image demoiréing aims, recaptured imagery, scales and directions, Image demoiréing, aims to remove

备注: Code is available at: [this https URL](https://github.com/zhengchen1999/BinaryDemoire)

点击查看摘要

Abstract:Image demoiréing aims to remove structured moiré artifacts in recaptured imagery, where degradations are highly frequency-dependent and vary across scales and directions. While recent deep networks achieve high-quality restoration, their full-precision designs remain costly for deployment. Binarization offers an extreme compression regime by quantizing both activations and weights to 1-bit. Yet, it has been rarely studied for demoiréing and performs poorly when naively applied. In this work, we propose BinaryDemoire, a binarized demoiréing framework that explicitly accommodates the frequency structure of moiré degradations. First, we introduce a moiré-aware binary gate (MABG) that extracts lightweight frequency descriptors together with activation statistics. It predicts channel-wise gating coefficients to condition the aggregation of binary convolution responses. Second, we design a shuffle-grouped residual adapter (SGRA) that performs structured sparse shortcut alignment. It further integrates interleaved mixing to promote information exchange across different channel partitions. Extensive experiments on four benchmarks demonstrate that the proposed BinaryDemoire surpasses current binarization methods. Code: this https URL.

90. 【2602.03157】Human-in-the-loop Adaptation in Group Activity Feature Learning for Team Sports Video Retrieval

链接https://arxiv.org/abs/2602.03157

作者:Chihiro Nakatani,Hiroaki Kawashima,Norimichi Ukita

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Group Activity Feature, Activity Feature Learning, group activity annotations, Activity Feature, Group Activity

备注: Accepted to Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: this https URL.

91. 【2602.03156】Fully Kolmogorov-Arnold Deep Model in Medical Image Segmentation

链接https://arxiv.org/abs/2602.03156

作者:Xingyu Qiu,Xinghua Ma,Dong Liang,Gongning Luo,Wei Wang,Kuanquan Wang,Shuo Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:practically impossible due, substantial memory requirements, high training difficulties, practically impossible, difficulties and substantial

备注: 11 pages, 5 figures, conference

点击查看摘要

Abstract:Deeply stacked KANs are practically impossible due to high training difficulties and substantial memory requirements. Consequently, existing studies can only incorporate few KAN layers, hindering the comprehensive exploration of KANs. This study overcomes these limitations and introduces the first fully KA-based deep model, demonstrating that KA-based layers can entirely replace traditional architectures in deep learning and achieve superior learning capacity. Specifically, (1) the proposed Share-activation KAN (SaKAN) reformulates Sprecher's variant of Kolmogorov-Arnold representation theorem, which achieves better optimization due to its simplified parameterization and denser training samples, to ease training difficulty, (2) this paper indicates that spline gradients contribute negligibly to training while consuming huge GPU memory, thus proposes the Grad-Free Spline to significantly reduce memory usage and computational overhead. (3) Building on these two innovations, our ALL U-KAN is the first representative implementation of fully KA-based deep model, where the proposed KA and KAonv layers completely replace FC and Conv layers. Extensive evaluations on three medical image segmentation tasks confirm the superiority of the full KA-based architecture compared to partial KA-based and traditional architectures, achieving all higher segmentation accuracy. Compared to directly deeply stacked KAN, ALL U-KAN achieves 10 times reduction in parameter count and reduces memory consumption by more than 20 times, unlocking the new explorations into deep KAN architectures.

92. 【2602.03139】Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

链接https://arxiv.org/abs/2602.03139

作者:Tianhe Wu,Ruibin Li,Lei Zhang,Kede Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Distribution matching distillation, low inference cost, enable high-quality generation, Distribution matching, aligns a multi-step

备注

点击查看摘要

Abstract:Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.

93. 【2602.03137】FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

链接https://arxiv.org/abs/2602.03137

作者:Chen-Bin Feng,Youyang Sha,Longfei Liu,Yongjun Yu,Chi Man Vong,Xuanlong Yu,Xi Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Foundation Models, Few-Shot Object Detectors, leverages vision foundation, Vision Foundation, Detectors with Vision

备注: Accepted by ICLR 2026. Code is available at: \url{ [this https URL](https://intellindust-ai-lab.github.io/projects/FSOD-VFM) }

点击查看摘要

Abstract:In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: this https URL.

94. 【2602.03134】SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

链接https://arxiv.org/abs/2602.03134

作者:Chen Qian,Xinran Yu,Danyang Li,Guoxuan Chi,Zheng Yang,Qiang Ma,Xin Miao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vision-language models, improve efficiency, Visual token, promising approach, approach for reducing

备注

点击查看摘要

Abstract:Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.

95. 【2602.03130】FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

链接https://arxiv.org/abs/2602.03130

作者:Chenxi Zhang,Ziliang Gan,Liyun Zhu,Youwei Pang,Qing Zhang,Rongjunchen Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

关键词:domain poses substantial, poses substantial challenges, financial domain poses, knowledge-intensive reasoning requirements, specialized chart formats

备注

点击查看摘要

Abstract:The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.

96. 【2602.03126】Flexible Geometric Guidance for Probabilistic Human Pose Estimation with Diffusion Models

链接https://arxiv.org/abs/2602.03126

作者:Francis Snelgar,Ming Xu,Stephen Gould,Liang Zheng,Akshay Asthana

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging problem due, ambiguity and occlusion, challenging problem, problem due, due to depth

备注

点击查看摘要

Abstract:3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple -- possibly infinite -- poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of-$m$ multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at this https URL .

97. 【2602.03124】Feature, Alignment, and Supervision in Category Learning: A Comparative Approach with Children and Neural Networks

链接https://arxiv.org/abs/2602.03124

作者:Fanxiao Wani Qiu,Oscar Leong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Understanding how humans, humans and machines, machines learn, science and machine, learn from sparse

备注

点击查看摘要

Abstract:Understanding how humans and machines learn from sparse data is central to cognitive science and machine learning. Using a species-fair design, we compare children and convolutional neural networks (CNNs) in a few-shot semi-supervised category learning task. Both learners are exposed to novel object categories under identical conditions. Learners receive mixtures of labeled and unlabeled exemplars while we vary supervision (1/3/6 labels), target feature (size, shape, pattern), and perceptual alignment (high/low). We find that children generalize rapidly from minimal labels but show strong feature-specific biases and sensitivity to alignment. CNNs show a different interaction profile: added supervision improves performance, but both alignment and feature structure moderate the impact additional supervision has on learning. These results show that human-model comparisons must be drawn under the right conditions, emphasizing interactions among supervision, feature structure, and alignment rather than overall accuracy.

98. 【2602.03123】Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models

链接https://arxiv.org/abs/2602.03123

作者:Judah Goldfeder,Shreyes Kaliyur,Vaibhav Sourirajan,Patrick Minwan Puma,Philippe Martin Wyder,Yuhang Hu,Jiong Lin,Hod Lipson

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:cornerstone for reducing, reducing overfitting, overfitting in vision, AutoAugment automating, automating the design

备注

点击查看摘要

Abstract:Data augmentation has long been a cornerstone for reducing overfitting in vision models, with methods like AutoAugment automating the design of task-specific augmentations. Recent advances in generative models, such as conditional diffusion and few-shot NeRFs, offer a new paradigm for data augmentation by synthesizing data with significantly greater diversity and realism. However, unlike traditional augmentations like cropping or rotation, these methods introduce substantial changes that enhance robustness but also risk degrading performance if the augmentations are poorly matched to the task. In this work, we present EvoAug, an automated augmentation learning pipeline, which leverages these generative models alongside an efficient evolutionary algorithm to learn optimal task-specific augmentations. Our pipeline introduces a novel approach to image augmentation that learns stochastic augmentation trees that hierarchically compose augmentations, enabling more structured and adaptive transformations. We demonstrate strong performance across fine-grained classification and few-shot learning tasks. Notably, our pipeline discovers augmentations that align with domain knowledge, even in low-data settings. These results highlight the potential of learned generative augmentations, unlocking new possibilities for robust model training.

99. 【2602.03105】Gromov Wasserstein Optimal Transport for Semantic Correspondences

链接https://arxiv.org/abs/2602.03105

作者:Francis Snelgar,Stephen Gould,Ming Xu,Liang Zheng,Akshay Asthana

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long studied problem, Establishing correspondences, computer vision, image pairs, long studied

备注

点击查看摘要

Abstract:Establishing correspondences between image pairs is a long studied problem in computer vision. With recent large-scale foundation models showing strong zero-shot performance on downstream tasks including classification and segmentation, there has been interest in using the internal feature maps of these models for the semantic correspondence task. Recent works observe that features from DINOv2 and Stable Diffusion (SD) are complementary, the former producing accurate but sparse correspondences, while the latter produces spatially consistent correspondences. As a result, current state-of-the-art methods for semantic correspondence involve combining features from both models in an ensemble. While the performance of these methods is impressive, they are computationally expensive, requiring evaluating feature maps from large-scale foundation models. In this work we take a different approach, instead replacing SD features with a superior matching algorithm which is imbued with the desirable spatial consistency property. Specifically, we replace the standard nearest neighbours matching with an optimal transport algorithm that includes a Gromov Wasserstein spatial smoothness prior. We show that we can significantly boost the performance of the DINOv2 baseline, and be competitive and sometimes surpassing state-of-the-art methods using Stable Diffusion features, while being 5--10x more efficient. We make code available at this https URL .

100. 【2602.03086】Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

链接https://arxiv.org/abs/2602.03086

作者:Jiayao Mai,Bangyan Liao,Zhenjun Zhao,Yingping Zeng,Haoang Li,Javier Civera,Tailin Wu,Yi Zhou,Peidong Liu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:global optimization, solving challenging problems, robust optimization, polynomial root-finding, principle for solving

备注

点击查看摘要

Abstract:The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.

101. 【2602.03076】A generalizable large-scale foundation model for musculoskeletal radiographs

链接https://arxiv.org/abs/2602.03076

作者:Shinn Kim,Soobin Lee,Kyoungseob Shin,Han-Soo Kim,Yongsung Kim,Minsu Kim,Juhong Nam,Somang Ko,Daeheon Kwon,Wook Huh,Ilkyu Han,Sunghoon Kwon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Artificial intelligence, characterizing musculoskeletal diseases, shown promise, promise in detecting, detecting and characterizing

备注

点击查看摘要

Abstract:Artificial intelligence (AI) has shown promise in detecting and characterizing musculoskeletal diseases from radiographs. However, most existing models remain task-specific, annotation-dependent, and limited in generalizability across diseases and anatomical regions. Although a generalizable foundation model trained on large-scale musculoskeletal radiographs is clinically needed, publicly available datasets remain limited in size and lack sufficient diversity to enable training across a wide range of musculoskeletal conditions and anatomical sites. Here, we present SKELEX, a large-scale foundation model for musculoskeletal radiographs, trained using self-supervised learning on 1.2 million diverse, condition-rich images. The model was evaluated on 12 downstream diagnostic tasks and generally outperformed baselines in fracture detection, osteoarthritis grading, and bone tumor classification. Furthermore, SKELEX demonstrated zero-shot abnormality localization, producing error maps that identified pathologic regions without task-specific training. Building on this capability, we developed an interpretable, region-guided model for predicting bone tumors, which maintained robust performance on independent external datasets and was deployed as a publicly accessible web application. Overall, SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for both clinical translation and data-efficient research in musculoskeletal radiology.

102. 【2602.03071】Finding Optimal Video Moment without Training: Gaussian Boundary Optimization for Weakly Supervised Video Grounding

链接https://arxiv.org/abs/2602.03071

作者:Sunoh Kim,Kimin Yun,Daeho Um

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly supervised temporal, video grounding aims, Weakly supervised, supervised temporal video, temporal video grounding

备注: Accepted in IEEE TMM

点击查看摘要

Abstract:Weakly supervised temporal video grounding aims to localize query-relevant segments in untrimmed videos using only video-sentence pairs, without requiring ground-truth segment annotations that specify exact temporal boundaries. Recent approaches tackle this task by utilizing Gaussian-based temporal proposals to represent query-relevant segments. However, their inference strategies rely on heuristic mappings from Gaussian parameters to segment boundaries, resulting in suboptimal localization performance. To address this issue, we propose Gaussian Boundary Optimization (GBO), a novel inference framework that predicts segment boundaries by solving a principled optimization problem that balances proposal coverage and segment compactness. We derive a closed-form solution for this problem and rigorously analyze the optimality conditions under varying penalty regimes. Beyond its theoretical foundations, GBO offers several practical advantages: it is training-free and compatible with both single-Gaussian and mixture-based proposal architectures. Our experiments show that GBO significantly improves localization, achieving state-of-the-art results across standard benchmarks. Extensive experiments demonstrate the efficiency and generalizability of GBO across various proposal schemes. The code is available at \href{this https URL}{this https URL}.

103. 【2602.03064】JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics

链接https://arxiv.org/abs/2602.03064

作者:Sandika Biswas,Kian Izadpanah,Hamid Rezatofighi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:inherently crowded, human pose, pose, human, human pose annotations

备注

点击查看摘要

Abstract:Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.

104. 【2602.03060】IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

链接https://arxiv.org/abs/2602.03060

作者:Zhichao Sun,Yidong Ma,Gang Liu,Yibo Chen,Xu Tang,Yao Hu,Yongchao Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, achieve impressive performance, achieve impressive

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks. Source codes are available at this https URL.

105. 【2602.03043】SAFE-KD: Risk-Controlled Early-Exit Distillation for Vision Backbones

链接https://arxiv.org/abs/2602.03043

作者:Salim Khazem

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Early-exit networks reduce, practical deployment hinges, Early-exit networks, networks reduce inference, reduce inference cost

备注: Submitted to IJCNN

点击查看摘要

Abstract:Early-exit networks reduce inference cost by allowing ``easy'' inputs to stop early, but practical deployment hinges on knowing \emph{when} early exit is safe. We introduce SAFE-KD, a universal multi-exit wrapper for modern vision backbones that couples hierarchical distillation with \emph{conformal risk control}. SAFE-KD attaches lightweight exit heads at intermediate depths, distills a strong teacher into all exits via Decoupled Knowledge Distillation (DKD), and enforces deep-to-shallow consistency between exits. At inference, we calibrate per-exit stopping thresholds on a held-out set using conformal risk control (CRC) to guarantee a user-specified \emph{selective} misclassification risk (among the samples that exit early) under exchangeability. Across multiple datasets and architectures, SAFE-KD yields improved accuracy compute trade-offs, stronger calibration, and robust performance under corruption while providing finite-sample risk guarantees.

106. 【2602.03039】HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

链接https://arxiv.org/abs/2602.03039

作者:Geonhui Son,Jeong Ryong Lee,Dosik Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Adversarial Networks, Generative Adversarial, Adversarial Networks, made significant progress, progress in enhancing

备注: Accepted manuscript. This is the accepted version of the article published in Neural Networks

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: this https URL.

107. 【2602.03038】Bongards at the Boundary of Perception and Reasoning: Programs or Language?

链接https://arxiv.org/abs/2602.03038

作者:Cassidy Langenfeld,Claas Beger,Gloria Geng,Wasu Top Piriyakulkij,Keya Hu,Yewen Pu,Kevin Ellis

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made great strides, answering commonsense questions, everyday visual tasks, Vision-Language Models, made great

备注: 6 pages, 5 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have made great strides in everyday visual tasks, such as captioning a natural image, or answering commonsense questions about such images. But humans possess the puzzling ability to deploy their visual reasoning abilities in radically new situations, a skill rigorously tested by the classic set of visual reasoning challenges known as the Bongard problems. We present a neurosymbolic approach to solving these problems: given a hypothesized solution rule for a Bongard problem, we leverage LLMs to generate parameterized programmatic representations for the rule and perform parameter fitting using Bayesian optimization. We evaluate our method on classifying Bongard problem images given the ground truth rule, as well as on solving the problems from scratch.

108. 【2602.03028】MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

链接https://arxiv.org/abs/2602.03028

作者:Wenzhang Sun,Zhenyu Wang,Zhangchi Hu,Chunfeng Wang,Hao Li,Wei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating long-form audio-visual, long-form audio-visual stories, short user prompt, user prompt remains, prompt remains challenging

备注

点击查看摘要

Abstract:Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.

109. 【2602.03015】A Vision-Based Analysis of Congestion Pricing in New York City

链接https://arxiv.org/abs/2602.03015

作者:Mehmet Kerem Turkcan,Jhonatan Tavori,Javad Ghaderi,Gil Zussman,Zoran Kostic,Andrew Smyth

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:York City congestion, City congestion pricing, York City, City congestion, congestion pricing program

备注

点击查看摘要

Abstract:We examine the impact of New York City's congestion pricing program through automated analysis of traffic camera data. Our computer vision pipeline processes footage from over 900 cameras distributed throughout Manhattan and New York, comparing traffic patterns from November 2024 through the program's implementation in January 2025 until January 2026. We establish baseline traffic patterns and identify systematic changes in vehicle density across the monitored region.

110. 【2602.03013】hinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side

链接https://arxiv.org/abs/2602.03013

作者:Haipeng Liu,Yang Wang,Biao Qian,Yong Rui,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, masked regions semantically, Neural Networks, earned substantial progress, final inpainting output

备注: 17 pages, 17 figures

点击查看摘要

Abstract:Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256*256 and 512*512, especially holds by substituting all the encoders by ours. Our code is available at this https URL

111. 【2602.03007】VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering

链接https://arxiv.org/abs/2602.03007

作者:Rahul Atul Bhope,K. R. Jayaram,Vinod Muthusamy,Ritesh Kumar,Vatche Isahagian,Nalini Venkatasubramanian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:high-fidelity visual inputs, processing high-fidelity visual, Visual Question Answering, fixed fidelity levels, visual inputs

备注

点击查看摘要

Abstract:Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.

112. 【2602.02994】Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

链接https://arxiv.org/abs/2602.02994

作者:Jiaze Li,Hao Yin,Haoran Xu,Boshen Xu,Wenhui Tan,Zewen He,Jianzhong Ju,Zhenbo Luo,Jian Luan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Temporal Video Grounding, Video Grounding, Temporal Video, existing GRPO-based methods, GRPO-based methods remain

备注

点击查看摘要

Abstract:Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

113. 【2602.02989】SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation

链接https://arxiv.org/abs/2602.02989

作者:Zhanfeng Liao,Jiajun Zhang,Hanzhang Tu,Zhixi Wang,Yunqi Gao,Hongwen Zhang,Yebin Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:immersive visual experiences, achieving photorealistic, reconstruction and immersive, visual experiences, view synthesis

备注

点击查看摘要

Abstract:Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitives' motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.

114. 【2602.02977】Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding

链接https://arxiv.org/abs/2602.02977

作者:Byeongju Woo,Zilin Wang,Byeonghyun Pak,Sangwoo Mo,Stella X. Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large vision-language models, CLIP struggle, Large vision-language, vision-language models, Large

备注: Preprint

点击查看摘要

Abstract:Large vision-language models such as CLIP struggle with long captions because they align images and texts as undifferentiated wholes. Fine-grained vision-language understanding requires hierarchical semantics capturing both global context and localized details across visual and textual domains. Yet linguistic hierarchies from syntax or semantics rarely match visual organization, and purely visual hierarchies tend to fragment scenes into appearance-driven parts without semantic focus. We propose CAFT (Cross-domain Alignment of Forests and Trees), a hierarchical image-text representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision. Coupling a fine-to-coarse visual encoder with a hierarchical text transformer, it uses a hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences, so that coarse semantics are built from fine-grained evidence rather than from aggregation untethered to part-level grounding. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that hierarchical cross-domain alignment enables fine-grained, visually grounded image-text representations to emerge without explicit region-level supervision.

115. 【2602.02974】SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

链接https://arxiv.org/abs/2602.02974

作者:Seok-Young Kim,Dooyoung Kim,Woojin Cho,Hail Song,Suji Kang,Woontack Woo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB sequences, experience Mixed Reality, introduce SceneLinker, Mixed Reality, generates compositional

备注: Accepted as an IEEE TVCG paper at IEEE VR 2026 (journal track)

点击查看摘要

Abstract:We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user's space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is this https URL.

116. 【2602.02973】Fisheye Stereo Vision: Depth and Range Error

链接https://arxiv.org/abs/2602.02973

作者:Leaf Jiang,Matthew Holzel,Bernhard Kaplan,Hsiou-Yuan Liu,Sabyasachi Paul,Karen Rankin,Piotr Swierczynski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:study derives analytical, derives analytical expressions, fisheye stereo vision, stereo vision systems, object distance

备注

点击查看摘要

Abstract:This study derives analytical expressions for the depth and range error of fisheye stereo vision systems as a function of object distance, specifically accounting for accuracy at large angles.

117. 【2602.02969】Dynamic High-frequency Convolution for Infrared Small Target Detection

链接https://arxiv.org/abs/2602.02969

作者:Ruojing Li,Chao Xiao,Qian Yin,Wei An,Nuo Chen,Xinyi Ying,Miao Li,Yingqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Infrared small targets, infrared small target, Infrared small, Single-frame infrared small, locally salient

备注

点击查看摘要

Abstract:Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at this https URL.

118. 【2602.02963】RACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

链接https://arxiv.org/abs/2602.02963

作者:OFM Riaz Rahman Aranya,Kevin Desai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:treatment response, disease progression, Anatomical Change Explanation, fundamental to clinical, Temporal

备注

点击查看摘要

Abstract:Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.

119. 【2602.02951】Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning

链接https://arxiv.org/abs/2602.02951

作者:Yihong Huang,Fei Ma,Yihua Shao,Jingcai Guo,Zitong Yu,Laizhong Cui,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Vision Language Model, Language Model, effective acceleration technique, efficient Vision Language, Vision Language

备注

点击查看摘要

Abstract:Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).

120. 【2602.02944】SRA-Seg: Synthetic to Real Alignment for Semi-Supervised Medical Image Segmentation

链接https://arxiv.org/abs/2602.02944

作者:OFM Riaz Rahman Aranya,Kevin Desai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image segmentation, extensive expert-annotated data, consistently fails, visual realism, appealing alternative

备注

点击查看摘要

Abstract:Synthetic data, an appealing alternative to extensive expert-annotated data for medical image segmentation, consistently fails to improve segmentation performance despite its visual realism. The reason being that synthetic and real medical images exist in different semantic feature spaces, creating a domain gap that current semi-supervised learning methods cannot bridge. We propose SRA-Seg, a framework explicitly designed to align synthetic and real feature distributions for medical image segmentation. SRA-Seg introduces a similarity-alignment (SA) loss using frozen DINOv2 embeddings to pull synthetic representations toward their nearest real counterparts in semantic space. We employ soft edge blending to create smooth anatomical transitions and continuous labels, eliminating the hard boundaries from traditional copy-paste augmentation. The framework generates pseudo-labels for synthetic images via an EMA teacher model and applies soft-segmentation losses that respect uncertainty in mixed regions. Our experiments demonstrate strong results: using only 10% labeled real data and 90% synthetic unlabeled data, SRA-Seg achieves 89.34% Dice on ACDC and 84.42% on FIVES, significantly outperforming existing semi-supervised methods and matching the performance of methods using real unlabeled data.

121. 【2602.02920】A Reproducible Framework for Bias-Resistant Machine Learning on Small-Sample Neuroimaging Data

链接https://arxiv.org/abs/2602.02920

作者:Jagan Mohan Reddy Dwarampudi,Jennifer L Purks,Joshua Wong,Renjie Hu,Tania Banerjee

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)

关键词:domain-informed feature engineering, small-sample neuroimaging data, integrates domain-informed feature, calibrated decision-threshold optimization, bias-resistant machine learning

备注: Accepted to ISBI 2026, 5 pages with 1 figure

点击查看摘要

Abstract:We introduce a reproducible, bias-resistant machine learning framework that integrates domain-informed feature engineering, nested cross-validation, and calibrated decision-threshold optimization for small-sample neuroimaging data. Conventional cross-validation frameworks that reuse the same folds for both model selection and performance estimation yield optimistically biased results, limiting reproducibility and generalization. Demonstrated on a high-dimensional structural MRI dataset of deep brain stimulation cognitive outcomes, the framework achieved a nested-CV balanced accuracy of 0.660\,$\pm$\,0.068 using a compact, interpretable subset selected via importance-guided ranking. By combining interpretability and unbiased evaluation, this work provides a generalizable computational blueprint for reliable machine learning in data-limited biomedical domains.

122. 【2602.02918】A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis

链接https://arxiv.org/abs/2602.02918

作者:Jagan Mohan Reddy Dwarampudi,Joshua Wong,Hien Van Nguyen,Tania Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)

关键词:Adaptive Recurrent Biomedical, Biomedical Linear-time Encoder, Recurrent Biomedical Linear-time, Multi-scale Adaptive Recurrent, Adaptive Recurrent

备注: Accepted to ISBI 2026, 4 pages with 2 figures

点击查看摘要

Abstract:We introduce Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE), the first \textit{purely Mamba-based} multi-state multiple instance learning (MIL) framework for whole-slide image (WSI) analysis. MARBLE processes multiple magnification levels in parallel and integrates coarse-to-fine reasoning within a linear-time state-space model, efficiently capturing cross-scale dependencies with minimal parameter overhead. WSI analysis remains challenging due to gigapixel resolutions and hierarchical magnifications, while existing MIL methods typically operate at a single scale and transformer-based approaches suffer from quadratic attention costs. By coupling parallel multi-scale processing with linear-time sequence modeling, MARBLE provides a scalable and modular alternative to attention-based architectures. Experiments on five public datasets show improvements of up to \textbf{6.9\%} in AUC, \textbf{20.3\%} in accuracy, and \textbf{2.3\%} in C-index, establishing MARBLE as an efficient and generalizable framework for multi-scale WSI analysis.

123. 【2602.02914】FaceLinkGen: Rethinking Identity Leakage in Privacy-Preserving Face Recognition with Identity Extraction

链接https://arxiv.org/abs/2602.02914

作者:Wenqi Guo,Shan Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformation-based privacy-preserving face, hiding facial data, Transformation-based privacy-preserving, privacy-preserving face recognition, aims to verify

备注

点击查看摘要

Abstract:Transformation-based privacy-preserving face recognition (PPFR) aims to verify identities while hiding facial data from attackers and malicious service providers. Existing evaluations mostly treat privacy as resistance to pixel-level reconstruction, measured by PSNR and SSIM. We show that this reconstruction-centric view fails. We present FaceLinkGen, an identity extraction attack that performs linkage/matching and face regeneration directly from protected templates without recovering original pixels. On three recent PPFR systems, FaceLinkGen reaches over 98.5\% matching accuracy and above 96\% regeneration success, and still exceeds 92\% matching and 94\% regeneration in a near zero knowledge setting. These results expose a structural gap between pixel distortion metrics, which are widely used in PPFR evaluation, and real privacy. We show that visual obfuscation leaves identity information broadly exposed to both external intruders and untrusted service providers.

124. 【2602.02908】A Random Matrix Theory Perspective on the Consistency of Diffusion Models

链接https://arxiv.org/abs/2602.02908

作者:Binxu Wang,Jacob Zavatone-Veth,Cengiz Pehlevan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:produce strikingly similar, strikingly similar outputs, non-overlapping subsets, produce strikingly, strikingly similar

备注: 65 pages; 53 figures

点击查看摘要

Abstract:Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $\sigma^2 \mapsto \kappa(\sigma^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.

125. 【2602.02894】DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging

链接https://arxiv.org/abs/2602.02894

作者:Daivik Patel,Shrenik Patel

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate decision making, existing approaches rely, returns redundant evidence, subtle visual differences, medical imaging requires

备注

点击查看摘要

Abstract:Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.

126. 【2602.02873】ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying

链接https://arxiv.org/abs/2602.02873

作者:Weihang You,Qingchan Zhu,David Liu,Yi Pan,Geng Yuan,Hanqi Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:discards continuous information, due to premature, conversion that discards, spatial layout, excels in language

备注

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.

127. 【2602.02850】Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

链接https://arxiv.org/abs/2602.02850

作者:Keqi Chen,Vinkle Srivastav,Armine Vardazaryan,Cindy Rolland,Didier Mutter,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Operating Room, Privacy preservation, data in Operating, Privacy, Room

备注

点击查看摘要

Abstract:Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by "retrieving" false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method's practical applicability. Code is available at this https URL.

128. 【2602.02820】From Tokens to Numbers: Continuous Number Modeling for SVG Generation

链接https://arxiv.org/abs/2602.02820

作者:Michael Ogezi,Martin Bell,Freda Shi,Ethan Smith

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Scalable Vector Graphics, offer clear benefits, vector graphics, image generation tasks, size efficiency

备注

点击查看摘要

Abstract:For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster-based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first-class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model's inputs with the data's continuous nature, removing discretization artifacts introduced by token-based encoding. We then train a multimodal transformer on 2 million raster-to-SVG samples, followed by fine-tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high-quality vector generation, with potential for broader applications. We make our code available this http URL.

129. 【2602.02808】LmPT: Conditional Point Transformer for Anatomical Landmark Detection on 3D Point Clouds

链接https://arxiv.org/abs/2602.02808

作者:Matteo Bastico,Pierre Onghena,David Ryckelynck,Beatriz Marcotegui,Santiago Velasco-Forero,Laurent Corté,Caroline Robine--Decourcelle,Etienne Decencière

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Accurate identification, medical applications, Landmark Point Transformer, Accurate, Abstract

备注: This paper has been accepted at International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time-consuming and prone to inter-observer variability, while rule-based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross-species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: this https URL.

130. 【2602.02765】SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

链接https://arxiv.org/abs/2602.02765

作者:Haruhiko Murata,Kazuhiro Hotta

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, large-scale foundation models, foundation models, established as large-scale, large-scale foundation

备注: 8 pages, 6 figures

点击查看摘要

Abstract:Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.

131. 【2602.02722】Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion

链接https://arxiv.org/abs/2602.02722

作者:Dan Haramati,Carl Qi,Tal Daniel,Amy Zhang,Aviv Tamar,George Konidaris

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Goal-Conditioned Reinforcement Learning, offline Goal-Conditioned Reinforcement, Reinforcement Learning, hierarchical entity-centric framework, combines subgoal decomposition

备注: ICLR 2026

点击查看摘要

Abstract:We propose a hierarchical entity-centric framework for offline Goal-Conditioned Reinforcement Learning (GCRL) that combines subgoal decomposition with factored structure to solve long-horizon tasks in domains with multiple entities. Achieving long-horizon goals in complex environments remains a core challenge in Reinforcement Learning (RL). Domains with multiple entities are particularly difficult due to their combinatorial complexity. GCRL facilitates generalization across goals and the use of subgoal structure, but struggles with high-dimensional observations and combinatorial state-spaces, especially under sparse reward. We employ a two-level hierarchy composed of a value-based GCRL agent and a factored subgoal-generating conditional diffusion model. The RL agent and subgoal generator are trained independently and composed post hoc through selective subgoal generation based on the value function, making the approach modular and compatible with existing GCRL algorithms. We introduce new variations to benchmark tasks that highlight the challenges of multi-entity domains, and show that our method consistently boosts performance of the underlying RL agent on image-based long-horizon tasks with sparse rewards, achieving over 150% higher success rates on the hardest task in our suite and generalizing to increasing horizons and numbers of entities. Rollout videos are provided at: this https URL

132. 【2602.02721】End-to-end reconstruction of OCT optical properties and speckle-reduced structural intensity via physics-based learning

链接https://arxiv.org/abs/2602.02721

作者:Jinglun Yu,Yaning Wang,Wenhan Guo,Yuan Gao,Yu Sun,Jin U. Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including refractive index, optical coherence tomography, scattering coefficient, coherence tomography, seeks to recover

备注

点击查看摘要

Abstract:Inverse scattering in optical coherence tomography (OCT) seeks to recover both structural images and intrinsic tissue optical properties, including refractive index, scattering coefficient, and anisotropy. This inverse problem is challenging due to attenuation, speckle noise, and strong coupling among parameters. We propose a regularized end-to-end deep learning framework that jointly reconstructs optical parameter maps and speckle-reduced OCT structural intensity for layer visualization. Trained with Monte Carlo-simulated ground truth, our network incorporates a physics-based OCT forward model that generates predicted signals from the estimated parameters, providing physics-consistent supervision for parameter recovery and artifact suppression. Experiments on the synthetic corneal OCT dataset demonstrate robust optical map recovery under noise, improved resolution, and enhanced structural fidelity. This approach enables quantitative multi-parameter tissue characterization and highlights the benefit of combining physics-informed modeling with deep learning for computational OCT.

133. 【2602.02676】AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

链接https://arxiv.org/abs/2602.02676

作者:Xintong Zhang,Xiaowen Zhang,Jongrong Wu,Zhi Gao,Shilin Yan,Zhenxin Diao,Kunpeng Gao,Xuanyan Chen,Yuwei Wu,Yunde Jia,Qing Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tool-augmented visual reasoning, promising frontier, frontier in Vision-Language, modulate between tool-augmented, tool-augmented visual

备注

点击查看摘要

Abstract:Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.

134. 【2602.02571】rajectory Consistency for One-Step Generation on Euler Mean Flows

链接https://arxiv.org/abs/2602.02571

作者:Zhiqi Li,Yuchen Sun,Duowen Chen,Jinjin He,Bo Zhu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Euler Mean Flows, enforces long-range trajectory, minimal sampling cost, long-range trajectory consistency, flow-based generative framework

备注: 40 pages, 27 figures

点击查看摘要

Abstract:We propose \emph{Euler Mean Flows (EMF)}, a flow-based generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both $u$-prediction and $x_1$-prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately $50\%$ reductions in training time and memory consumption compared to existing one-step methods for image generation.

135. 【2602.02560】Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

链接https://arxiv.org/abs/2602.02560

作者:Bartlomiej Sobieski,Jakub Grzywaczewski,Karol Dobiczek,Mateusz Wójcik,Tomasz Bartczak,Patryk Szatkowski,Przemysław Bombiński,Matthew Tivnan,Przemyslaw Biecek

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Lung cancer remains, automated screening tools, Lung cancer, alleviate radiologist workload, cancer mortality

备注: Preprint

点击查看摘要

Abstract:Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

136. 【2602.02559】Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

链接https://arxiv.org/abs/2602.02559

作者:Pengyu Dai,Weihao Xuan,Junjue Wang,Hongruixuan Chen,Jian Song,Yafei Ou,Naoto Yokoya

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:large language model, enabled large language, Recent advances, orchestrating external tools, language model

备注: 21 pages, 6 figures

点击查看摘要

Abstract:Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce \textbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12\% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.

137. 【2602.02551】EEO-TFV: Escape-Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis

链接https://arxiv.org/abs/2602.02551

作者:Hua Wang,Jinghao Lu,Fan Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, Transformer-based foundation models, Transformer-based foundation, achieved remarkable, remarkable progress

备注: Main paper: 12 pages

点击查看摘要

Abstract:Transformer-based foundation models have achieved remarkable progress in tasks such as time-series forecasting and image segmentation. However, they frequently suffer from error accumulation in multivariate long-sequence prediction and exhibit vulnerability to out-of-distribution samples in image-related tasks. Furthermore, these challenges become particularly pronounced in large-scale Web data analysis tasks, which typically involve complex temporal patterns and multimodal features. This complexity substantially increases optimization difficulty, rendering models prone to stagnation at saddle points within high-dimensional parameter spaces. To address these issues, we propose a lightweight Transformer architecture in conjunction with a novel Escape-Explore Optimizer (EEO). The optimizer enhances both exploration and generalization while effectively avoiding sharp minima and saddle-point traps. Experimental results show that, in representative Web data scenarios, our method achieves performance on par with state-of-the-art models across 11 time-series benchmark datasets and the Synapse medical image segmentation task. Moreover, it demonstrates superior generalization and stability, thereby validating its potential as a versatile cross-task foundation model for Web-scale data mining and analysis.

138. 【2602.02548】oolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

链接https://arxiv.org/abs/2602.02548

作者:Xiaoce Wang,Guibin Zhang,Junzhe Li,Jinzhe Tu,Chun Li,Ming Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:Existing GUI agent, varying input resolutions, Existing GUI, one-step visual grounding, visual grounding struggle

备注: 8 pages main paper, 18 pages total, 8 figures, 5 tables, code at [this https URL](https://github.com/ZephinueCode/ToolTok)

点击查看摘要

Abstract:Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training inference code is open-source at this https URL.

139. 【2602.02539】How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs

链接https://arxiv.org/abs/2602.02539

作者:Shuxin Zhuang,Zi Liang,Runsheng Yu,Hongzong Li,Rong Feng,Shiqin Tang,Youzhi Zhang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent vision-centric approaches, made significant strides, Recent vision-centric, long-context modeling, vision-centric approaches

备注

点击查看摘要

Abstract:Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.

140. 【2602.02538】Enhancing Post-Training Quantization via Future Activation Awareness

链接https://arxiv.org/abs/2602.02538

作者:Zheqi Lv,Zhenxuan Fan,Qi Tian,Wenqiao Zhang,Yueting Zhuang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, compress large language, Post-training quantization, language models, compress large

备注

点击查看摘要

Abstract:Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

141. 【2602.02537】WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

链接https://arxiv.org/abs/2602.02537

作者:Runjie Zhou,Youbo Shao,Haoyu Lu,Bowei Xing,Tongtong Bai,Yujie Chen,Jie Zhao,Lin Sui,Haotian Yao,Zijia Zhao,Hao Yang,Haoning Wu,Zaida Zhou,Jinguo Zhu,Zhiqi Huang,Yiping Bao,Yangyang Liu,Y.Charles,Xinyu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注

点击查看摘要

Abstract:We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

142. 【2602.02536】From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation

链接https://arxiv.org/abs/2602.02536

作者:Tianle Gu,Kexin Huang,Lingyu Li,Ruilin Luo,Shiyang Huang,Zongqi Wang,Yujiu Yang,Yan Teng,Yingchun Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:identifying harmful content, harmful content, pivotal for identifying, identifying harmful, Safety

备注

点击查看摘要

Abstract:Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi-head scalar reward model (UniRM). UniRM provides multi-dimensional supervision by assigning attribute-level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40\% of the training data used by leading baselines. Ablations further validate our multi-attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \href{this https URL}{project website}.

143. 【2602.02510】Beyond Translation: Cross-Cultural Meme Transcreation with Vision-Language Models

链接https://arxiv.org/abs/2602.02510

作者:Yuming Zhao,Peiyi Zhang,Oana Ignat

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:specificity poses significant, poses significant challenges, cultural specificity poses, online communication, cross-cultural meme transcreation

备注

点击查看摘要

Abstract:Memes are a pervasive form of online communication, yet their cultural specificity poses significant challenges for cross-cultural adaptation. We study cross-cultural meme transcreation, a multimodal generation task that aims to preserve communicative intent and humor while adapting culture-specific references. We propose a hybrid transcreation framework based on vision-language models and introduce a large-scale bidirectional dataset of Chinese and US memes. Using both human judgments and automated evaluation, we analyze 6,315 meme pairs and assess transcreation quality across cultural directions. Our results show that current vision-language models can perform cross-cultural meme transcreation to a limited extent, but exhibit clear directional asymmetries: US-Chinese transcreation consistently achieves higher quality than Chinese-US. We further identify which aspects of humor and visual-textual design transfer across cultures and which remain challenging, and propose an evaluation framework for assessing cross-cultural multimodal generation. Our code and dataset are publicly available at this https URL.

144. 【2602.02488】RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

链接https://arxiv.org/abs/2602.02488

作者:Yinjie Wang,Tianbao Xie,Ke Shen,Mengdi Wang,Ling Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:reinforcement learning framework, dynamically forges environment, amplifying learning signals, closed-loop optimization, framework that dynamically

备注: Code: [this https URL](https://github.com/Gen-Verse/Open-AgentRL)

点击查看摘要

Abstract:We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL

145. 【2602.02033】One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

链接https://arxiv.org/abs/2602.02033

作者:Shuo Lu,Haohan Wang,Wei Feng,Weizhen Wang,Shen Zhang,Yaoyu Li,Ao Ma,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Bing Zhan,Yuan Xu,Huizai Yao,Yongcan Yu,Chenyang Si,Jian Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:existing approaches adopt, Click-Through Rate, neglecting preference diversity, Advertising image generation, strategy that optimizes

备注

点击查看摘要

Abstract:Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group's CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at this https URL.

146. 【2602.01538】Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

链接https://arxiv.org/abs/2602.01538

作者:Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:talking avatars, grounded human-object, Aware Generation Module, perception, grounded human-object interaction

备注

点击查看摘要

Abstract:Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL

147. 【2602.03824】Deep-learning-based pan-phenomic data reveals the explosive evolution of avian visual disparity

链接https://arxiv.org/abs/2602.03824

作者:Jiao Sun

类目:Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV)

关键词:high-dimensional embedding space, involve subjective biases, avian morphological evolution, natural world, morphology is critical

备注: Readers from the field of computer science may be interested in section 2.1, 2.2, 3.1, 4.1, 4.2. These sections discussed the interpretability and representation learning, especially the texture vs shape problem, highlighting our model's ability of overcoming the texture biases and capturing overall shape features. (Although they're put here to prove the biological validity of the model.)

点击查看摘要

Abstract:The evolution of biological morphology is critical for understanding the diversity of the natural world, yet traditional analyses often involve subjective biases in the selection and coding of morphological traits. This study employs deep learning techniques, utilising a ResNet34 model capable of recognising over 10,000 bird species, to explore avian morphological evolution. We extract weights from the model's final fully connected (fc) layer and investigate the semantic alignment between the high-dimensional embedding space learned by the model and biological phenotypes. The results demonstrate that the high-dimensional embedding space encodes phenotypic convergence. Subsequently, we assess the morphological disparity among various taxa and evaluate the association between morphological disparity and species richness, demonstrating that species richness is the primary driver of morphospace expansion. Moreover, the disparity-through-time analysis reveals a visual "early burst" after the K-Pg extinction. While mainly aimed at evolutionary analysis, this study also provides insights into the interpretability of Deep Neural Networks. We demonstrate that hierarchical semantic structures (biological taxonomy) emerged in the high-dimensional embedding space despite being trained on flat labels. Furthermore, through adversarial examples, we provide evidence that our model in this task can overcome texture bias and learn holistic shape representations (body plans), challenging the prevailing view that CNNs rely primarily on local textures.

Comments:
Readers from the field of computer science may be interested in section 2.1, 2.2, 3.1, 4.1, 4.2. These sections discussed the interpretability and representation learning, especially the texture vs shape problem, highlighting our model’s ability of overcoming the texture biases and capturing overall shape features. (Although they’re put here to prove the biological validity of the model.)

Subjects:

Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.03824 [q-bio.PE]

(or
arXiv:2602.03824v1 [q-bio.PE] for this version)

https://doi.org/10.48550/arXiv.2602.03824

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
148. 【2602.02798】Real-time topology-aware M-mode OCT segmentation for robotic deep anterior lamellar keratoplasty (DALK) guidance

链接https://arxiv.org/abs/2602.02798

作者:Rosalinda Xiong,Jinglun Yu,Yaning Wang,Ziyi Huang,Jin U. Kang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:approach Descemet membrane, Robotic deep anterior, anterior lamellar keratoplasty, deep anterior lamellar, Descemet membrane

备注

点击查看摘要

Abstract:Robotic deep anterior lamellar keratoplasty (DALK) requires accurate real time depth feedback to approach Descemet's membrane (DM) without perforation. M-mode intraoperative optical coherence tomography (OCT) provides high temporal resolution depth traces, but speckle noise, attenuation, and instrument induced shadowing often result in discontinuous or ambiguous layer interfaces that challenge anatomically consistent segmentation at deployment frame rates. We present a lightweight, topology aware M-mode segmentation pipeline based on UNeXt that incorporates anatomical topology regularization to stabilize boundary continuity and layer ordering under low signal to noise ratio conditions. The proposed system achieves end to end throughput exceeding 80 Hz measured over the complete preprocessing inference overlay pipeline on a single GPU, demonstrating practical real time guidance beyond model only timing. This operating margin provides temporal headroom to reject low quality or dropout frames while maintaining a stable effective depth update rate. Evaluation on a standard rabbit eye M-mode dataset using an established baseline protocol shows improved qualitative boundary stability compared with topology agnostic controls, while preserving deployable real time performance.

149. 【2602.02755】Physics-based generation of multilayer corneal OCT data via Gaussian modeling and MCML for AI-driven diagnostic and surgical guidance applications

链接https://arxiv.org/abs/2602.02755

作者:Jinglun Yu,Yaning Wang,Rosalinda Xiong,Ziyi Huang,Kristina Irsch,Jin U. Kang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:optical coherence tomography, B-scan optical OCT, Training deep learning, deep learning models, corneal optical coherence

备注

点击查看摘要

Abstract:Training deep learning models for corneal optical coherence tomography (OCT) imaging is limited by the availability of large, well-annotated datasets. We present a configurable Monte Carlo simulation framework that generates synthetic corneal B-scan optical OCT images with pixel-level five-layer segmentation labels derived directly from the simulation geometry. A five-layer corneal model with Gaussian surfaces captures curvature and thickness variability in healthy and keratoconic eyes. Each layer is assigned optical properties from the literature and light transport is simulated using Monte Carlo modeling of light transport in multi-layered tissues (MCML), while incorporating system features such as the confocal PSF and sensitivity roll-off. This approach produces over 10,000 high-resolution (1024x1024) image-label pairs and supports customization of geometry, photon count, noise, and system parameters. The resulting dataset enables systematic training, validation, and benchmarking of AI models under controlled, ground-truth conditions, providing a reproducible and scalable resource to support the development of diagnostic and surgical guidance applications in image-guided ophthalmology.

150. 【2602.02713】Perfusion Imaging and Single Material Reconstruction in Polychromatic Photon Counting CT

链接https://arxiv.org/abs/2602.02713

作者:Namhoon Kim,Ashwin Pananjady,Amir Pourmorteza,Sara Fridovich-Keil

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Single Material reconstruction, single material, Perfusion computed tomography, Toggle, PeRfusion Imaging

备注: Code is available at [this https URL](https://github.com/voilalab/VI-PRISM)

点击查看摘要

Abstract:Background: Perfusion computed tomography (CT) images the dynamics of a contrast agent through the body over time, and is one of the highest X-ray dose scans in medical imaging. Recently, a theoretically justified reconstruction algorithm based on a monotone variational inequality (VI) was proposed for single material polychromatic photon-counting CT, and showed promising early results at low-dose imaging. Purpose: We adapt this reconstruction algorithm for perfusion CT, to reconstruct the concentration map of the contrast agent while the static background tissue is assumed known; we call our method VI-PRISM (VI-based PeRfusion Imaging and Single Material reconstruction). We evaluate its potential for dose-reduced perfusion CT, using a digital phantom with water and iodine of varying concentration. Methods: Simulated iodine concentrations range from 0.05 to 2.5 mg/ml. The simulated X-ray source emits photons up to 100 keV, with average intensity ranging from $10^5$ down to $10^2$ photons per detector element. The number of tomographic projections was varied from 984 down to 8 to characterize the tradeoff in photon allocation between views and intensity. Results: We compare VI-PRISM against filtered back-projection (FBP), and find that VI-PRISM recovers iodine concentration with error below 0.4 mg/ml at all source intensity levels tested. Even with a dose reduction between 10x and 100x compared to FBP, VI-PRISM exhibits reconstruction quality on par with FBP. Conclusion: Across all photon budgets and angular sampling densities tested, VI-PRISM achieved consistently lower RMSE, reduced noise, and higher SNR compared to filtered back-projection. Even in extremely photon-limited and sparsely sampled regimes, VI-PRISM recovered iodine concentrations with errors below 0.4 mg/ml, showing that VI-PRISM can support accurate and dose-efficient perfusion imaging in photon-counting CT.

Comments:
Code is available at this https URL

Subjects:

Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as:
arXiv:2602.02713 [physics.med-ph]

(or
arXiv:2602.02713v1 [physics.med-ph] for this version)

https://doi.org/10.48550/arXiv.2602.02713

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Sara Fridovich-Keil [view email] [v1]
Mon, 2 Feb 2026 19:27:55 UTC (19,289 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Perfusion Imaging and Single Material Reconstruction in Polychromatic Photon Counting CT, by Namhoon Kim and 3 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: physics.med-ph

prev

|
next

new
|
recent
| 2026-02

Change to browse by:

cs
cs.CV
eess
eess.IV
physics

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

151. 【2602.02603】EchoJEPA: A Latent Predictive Foundation Model for Echocardiography

链接https://arxiv.org/abs/2602.02603

作者:Alif Munim,Adibvafa Fallahpour,Teodora Szasz,Ahmadreza Attarpour,River Jiang,Brana Sooriyakanthan,Maala Sooriyakanthan,Heather Whitney,Jeremy Slivnick,Barry Rubin,Wendy Tsang,Bo Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:unlabeled video archives, improve diagnostic consistency, learning generalizable representations, large unlabeled video, reduce annotation burden

备注

点击查看摘要

Abstract:Foundation models for echocardiography promise to reduce annotation burden and improve diagnostic consistency by learning generalizable representations from large unlabeled video archives. However, current approaches fail to disentangle anatomical signal from the stochastic speckle and acquisition artifacts that dominate ultrasound imagery. We present EchoJEPA, a foundation model for echocardiography trained on 18 million echocardiograms across 300K patients, the largest pretraining corpus for this modality to date. We also introduce a novel multi-view probing framework with factorized stream embeddings that standardizes evaluation under frozen backbones. Compared to prior methods, EchoJEPA reduces left ventricular ejection fraction estimation error by 19% and achieves 87.4% view classification accuracy. EchoJEPA exhibits strong sample efficiency, reaching 78.6% accuracy with only 1% of labeled data versus 42.1% for the best baseline trained on 100%. Under acoustic perturbations, EchoJEPA degrades by only 2.3% compared to 16.8% for the next best model, and transfers zero-shot to pediatric patients with 15% lower error than the next best model, outperforming all fine-tuned baselines. These results establish latent prediction as a superior paradigm for ultrasound foundation models.

152. 【2602.02552】Super-résolution non supervisée d'images hyperspectrales de télédétection utilisant un entraînement entièrement synthétique

链接https://arxiv.org/abs/2602.02552

作者:Xinxin Xu,Yann Gousseau,Christophe Kervazo,Saïd Ladjal

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:enhance spatial resolution, rich spectral information, aims to enhance, Hyperspectral single image, enhance spatial

备注: in French language

点击查看摘要

Abstract:Hyperspectral single image super-resolution (SISR) aims to enhance spatial resolution while preserving the rich spectral information of hyperspectral images. Most existing methods rely on supervised learning with high-resolution ground truth data, which is often unavailable in practice. To overcome this limitation, we propose an unsupervised learning approach based on synthetic abundance data. The hyperspectral image is first decomposed into endmembers and abundance maps through hyperspectral unmixing. A neural network is then trained to super-resolve these maps using data generated with the dead leaves model, which replicates the statistical properties of real abundances. The final super-resolution hyperspectral image is reconstructed by recombining the super-resolved abundance maps with the endmembers. Experimental results demonstrate the effectiveness of our method and the relevance of synthetic data for training.