本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新838篇论文,其中:

  • 自然语言处理103
  • 信息检索31
  • 计算机视觉168

自然语言处理

1. 【2605.20179】IDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

链接https://arxiv.org/abs/2605.20179

作者:Zhiben Chen,Youpeng Zhao,Yang Sui,Jun Wang,Yuzhang Shang

类目:Computation and Language (cs.CL)

关键词:Diffusion Large Language, Large Language Models, Large Language, parallel block-level decoding, Diffusion Large

备注

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

2. 【2605.20177】From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

链接https://arxiv.org/abs/2605.20177

作者:Juncheng Wu,Hardy Chen,Haoqin Tu,Xianfeng Tang,Freda Shi,Hui Liu,Hanqing Lu,Cihang Xie,Yuyin Zhou

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, emphasize long, visual perception, reasoning, advances in vision-language

备注: 19 pages, 9 figures; Accepted to ICML 2026; Project Page: [this https URL](https://ucsc-vlaa.github.io/VLM-CapCurriculum/)

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

3. 【2605.20176】ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

链接https://arxiv.org/abs/2605.20176

作者:Juncheng Wu,Letian Zhang,Yuhan Wang,Haoqin Tu,Hardy Chen,Zijun Wang,Cihang Xie,Yuyin Zhou

类目:Computation and Language (cs.CL)

关键词:Large language models, works largely assume, Large language, Claude Opus, improves Claude Opus

备注: 24 pages, 9 figures; Project Page: [this https URL](https://ucsc-vlaa.github.io/ClinSeekAgent/)

点击查看摘要

Abstract:Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

4. 【2605.20170】KoRe: Compact Knowledge Representations for Large Language Models

链接https://arxiv.org/abs/2605.20170

作者:Davide Cavicchini,Fausto Giunchiglia,Jacopo Staiano

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Modern Large Language, Large Language, Language Models, shown impressive performances

备注

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

5. 【2605.20158】Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

链接https://arxiv.org/abs/2605.20158

作者:Guangzhi Xiong,Qiao Jin,Sanchit Sinha,Zhiyong Lu,Aidong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Vision Language, Vision Language Models, Large Vision, Vision Language, faithfully ground responses

备注

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at this https URL.

6. 【2605.20149】Less Back-and-Forth: A Comparative Study of Structured Prompting

链接https://arxiv.org/abs/2605.20149

作者:Saurav Ghosh,Gabriella Polach,Abdou Sow

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large language models, Large language, language models, lead to low-quality, low-quality answers

备注: 7 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

7. 【2605.20128】MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

链接https://arxiv.org/abs/2605.20128

作者:Yuanqing Cai,Ziyi Huang,Minhao Liu,Lixin Duan,Wen Li,Yanru Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, high-stakes decision-making, increasingly integrated, integrated into high-stakes

备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

8. 【2605.20087】houghtTrace: Understanding User Thoughts in Real-World LLM Interactions

链接https://arxiv.org/abs/2605.20087

作者:Chuanyang Jin,Binze Li,Haopeng Xie,Cathy Mengying Fang,Tianjian Li,Shayne Longpre,Hongxiang Gu,Maximillian Chen,Tianmin Shu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:existing datasets capture, reached billions, Conversational, existing datasets, thoughts

备注: 53 pages, 23 figures, 4 tables. Project website: [this https URL](https://thoughttrace-project.github.io/)

点击查看摘要

Abstract:Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

9. 【2605.20084】BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.20084

作者:Zijun Jia,Yuanchang Ye,Sen Jia,Yiyao Qian,Haoning Wang,Baojie Chen,Diyin Tang,Jinsong Yu,Zhiyuan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models, retrieval-augmented generation, answer is reliable

备注

点击查看摘要

Abstract:Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

10. 【2605.20075】CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

链接https://arxiv.org/abs/2605.20075

作者:Dachuan Shi,Hanlin Zhu,Xiangchi Yuan,Wanjia Zhao,Kejing Xia,Wen Xiao,Wenke Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, eliciting reasoning capabilities, standard approach, approach for eliciting, capabilities from large

备注: Code: [this https URL](https://github.com/sdc17/CopT) , Website: [this https URL](https://copt-web.github.io/)

点击查看摘要

Abstract:Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at this https URL.

11. 【2605.20066】xt-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

链接https://arxiv.org/abs/2605.20066

作者:Jann Pfeifer,Debayan Banerjee,Ricardo Usbeck

类目:Computation and Language (cs.CL)

关键词:Knowledge graph question, graph question answering, Knowledge graph, knowledge graphs, question answering seeks

备注: Accepted by NeSy 2026

点击查看摘要

Abstract:Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

12. 【2605.20061】Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

链接https://arxiv.org/abs/2605.20061

作者:Wenjie Tang,Minne Li,Sijie Huang,Liquan Xiao,Yuan Zhou

类目:Computation and Language (cs.CL)

关键词:improving large language, large language model, paradigm for improving, improving large, large language

备注: 10 pages, 4 figures, 3 tables, plus appendix

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: this https URL.

13. 【2605.20052】PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

链接https://arxiv.org/abs/2605.20052

作者:Ying-Jia Lin,Tzu-Chin Lo,Ping-Chien Li,Chi-Tung Cheng,Chien-Hung Liao,Hung-Yu Kao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:medical imaging research, enables large-scale annotation, Automatic report labeling, Automatic report, imaging research

备注: BioNLP 2026 @ ACL

点击查看摘要

Abstract:Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at this https URL.

14. 【2605.20050】Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

链接https://arxiv.org/abs/2605.20050

作者:Calvin Yixiang Cheng,Dorian Quelle,Scott A. Hale

类目:Computation and Language (cs.CL)

关键词:study investigates, affect the persistent, persistent diffusion, language mutations affect, social media

备注

点击查看摘要

Abstract:This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

15. 【2605.20043】Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

链接https://arxiv.org/abs/2605.20043

作者:Wen Zhang

类目:Computation and Language (cs.CL)

关键词:representational system encoding, system encoding morphophonological, encoding morphophonological distinctions, analysis of Japanese, Japanese past-tense morphological

备注

点击查看摘要

Abstract:We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

16. 【2605.20022】FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

链接https://arxiv.org/abs/2605.20022

作者:Yaojie Zhang,Jianuo Huang,Junlong Ke,Yuhang Han,Yongji Long,Tianchen Zhao,Biqing Qi,Linfeng Zhang

类目:Computation and Language (cs.CL)

关键词:memory-bound LLM inference, accelerates memory-bound LLM, memory-bound LLM, LLM inference, Speculative decoding

备注

点击查看摘要

Abstract:Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

17. 【2605.19952】Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

链接https://arxiv.org/abs/2605.19952

作者:Jingwei Sun,Jianing Zhu,Jiangchao Yao,Tongliang Liu,Bo Han

类目:Computation and Language (cs.CL)

关键词:reliable long-term interaction, enable reliable long-term, LLM agents require, accumulated dialogue history, efficiently retrieve

备注

点击查看摘要

Abstract:To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at this https URL .

18. 【2605.19945】GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

链接https://arxiv.org/abs/2605.19945

作者:Sourish Wawdhane,Avinash Kumar,Poulami Das

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enable efficient inference, employing smaller experts, models enable efficient, GPUs, experts

备注: 18 pages

点击查看摘要

Abstract:Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU variability-aware expert to GPU mapping for MoE models. GEM exploits two insights. First, we must place experts such that each GPU receives non-uniform token loads based on their variability and they all finish processing a layer at about the same time. Our studies show that there are two types of experts: consistent that are used most of the time and temporal that are often used together for the remaining time. Our second insight is that we must place simultaneously used consistent and temporal experts on different GPUs and avoid placing them on slower GPUs to reduce slowdown. GEM gathers the variability profile of GPUs for each model and task and uses the token load distributions per task to map experts to GPUs. Our experiments show that GEM improves end-to-end latency by 7.9% on average and by up to 16.5% compared to the baseline.

Comments:
18 pages

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2605.19945 [cs.DC]

(or
arXiv:2605.19945v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2605.19945

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
19. 【2605.19944】A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

链接https://arxiv.org/abs/2605.19944

作者:Yuyang Zhang,Yifu Zhang,Xuehai Zhou,Xiaoyin Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)

关键词:laws for LLM, generalization remain elusive, theoretical mechanisms governing, LLM reasoning, empirical scaling laws

备注: Preprint

点击查看摘要

Abstract:While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $\Omega(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

20. 【2605.19936】What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

链接https://arxiv.org/abs/2605.19936

作者:Filip Miletić,Neele Falk

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, scientific communication changed, communication changed due, large language models, style of scientific

备注: Accepted to LREC 2026

点击查看摘要

Abstract:Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

21. 【2605.19932】PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

链接https://arxiv.org/abs/2605.19932

作者:Zhuohan Gu,Qizheng Zhang,Omar Khattab,Samuel Madden

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language model, Large language, agents increasingly operate, language model, code repositories

备注

点击查看摘要

Abstract:Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

22. 【2605.19908】Where Does Authorship Signal Emerge in Encoder-Based Language Models?

链接https://arxiv.org/abs/2605.19908

作者:Francis Kulumba,Guillaume Vimont,Laurent Romary,Florian Cafiero

类目:Computation and Language (cs.CL)

关键词:attribution models fine-tuned, scoring mechanism, Authorship attribution models, loss can differ, differ four-fold

备注: 12 pages, 6 figures. Under review

点击查看摘要

Abstract:Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

23. 【2605.19852】Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

链接https://arxiv.org/abs/2605.19852

作者:Qinghe Ma,Zhen Zhao,Yiming Wu,Jian Zhang,Lei Bai,Yinghuan Shi

类目:Computation and Language (cs.CL)

关键词:multimodal large language, large language models, promising direction, direction for enhancing, capabilities of multimodal

备注: Accepted to ICML 2026

点击查看摘要

Abstract:Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at this https URL.

24. 【2605.19848】CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

链接https://arxiv.org/abs/2605.19848

作者:Yike Sun,Mingkun Xu,Mu You,Zhongzhi He,Henghua Shen,Zehan Tan,Derek F. Wong,Tao Fang

类目:Computation and Language (cs.CL)

关键词:deep learning models, recent years, diagnosis and finance, black-box nature, nature of deep

备注

点击查看摘要

Abstract:In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

25. 【2605.19846】FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

链接https://arxiv.org/abs/2605.19846

作者:Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Hung-Ting Su,Winston H. Hsu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrated remarkable capabilities, real-world applications requiring, applications requiring nuanced, requiring nuanced interpretation, fine-grained comprehension crucial

备注: CVPR'26 (Workshop on Video Large Language Models)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

26. 【2605.19837】CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

链接https://arxiv.org/abs/2605.19837

作者:Sherif Khairy,Catherine M. Elias

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:degrades camera-based object, Adverse weather, degrades camera-based, autonomous vehicles, camera-based object detection

备注

点击查看摘要

Abstract:Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

27. 【2605.19833】Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

链接https://arxiv.org/abs/2605.19833

作者:Zhifei Xie,Kaiyu Pang,Haobin Zhang,Deheng Ye,Xiaobin Hu,Shuicheng Yan,Chunyan Miao

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, large audio-language models, real-world environments remains, environments remains limited, acoustic robustness bottleneck

备注: Project page: [this https URL](https://xzf-thu.github.io/Mega-ASR/) . Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

点击查看摘要

Abstract:Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

28. 【2605.19824】From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

链接https://arxiv.org/abs/2605.19824

作者:Ahmed Y. Gado,Omar Y. Goba,Alaa Hassanein,Catherine M. Elias,Ahmed Hussein

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Large Language Models, Large Multimodal Models, Language Models, Multimodal Models, Autonomous Vehicles

备注

点击查看摘要

Abstract:Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

29. 【2605.19815】LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

链接https://arxiv.org/abs/2605.19815

作者:Shanshan Xu,Johan Lindholm,Amogh Raina,Henrik Palmer Olsen,Daniel Hershcovich

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Legal NLP, doctrinal scholarship, reasoning and doctrinal, Legal, Court of Justice

备注

点击查看摘要

Abstract:Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

30. 【2605.19806】Chunking German Legal Code

链接https://arxiv.org/abs/2605.19806

作者:Max Prior,Natalia Milanova,Andreas Schultz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:German Civil Code, German statutory law, structured benchmark corpus, German Civil, Civil Code

备注

点击查看摘要

Abstract:This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

31. 【2605.19798】owards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

链接https://arxiv.org/abs/2605.19798

作者:Lucie Galland,Chloé Clavel,Magalie Ochs

类目:Computation and Language (cs.CL)

关键词:Socially Interactive Agents, Socially Interactive, agent actual capabilities, Interactive Agents, Large Language Models

备注

点击查看摘要

Abstract:As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

32. 【2605.19766】Synthesis and Evaluation of Long-term History-aware Medical Dialogue

链接https://arxiv.org/abs/2605.19766

作者:Hebin Hu,Renke Dai,Ah-Hwee Tan,Yilin Kang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recall and reason, effective healthcare agent, reasoning, effective healthcare, Cross-dialogue Reasoning

备注: Accepted by AAMAS 2026

点击查看摘要

Abstract:An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark's applicability and underscore the need for tailored methods to advance healthcare agents.

33. 【2605.19762】What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

链接https://arxiv.org/abs/2605.19762

作者:Yuze Zhao,Junpeng Fang,Lu Yu,Zhenya Huang,Kai Zhang,Qing Cui,Qi Liu,Jun Zhou,Enhong Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:foundation language model, modern foundation language, programming remains unclear, language model, remains unclear

备注: Accepted by ICML 2026, 22 pages, 10 figures

点击查看摘要

Abstract:Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

34. 【2605.19738】ERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

链接https://arxiv.org/abs/2605.19738

作者:Wen Shi,Zhe Wang,Huafei Huang,Qing Qing,Ziqi Xu,Qixin Zhang,Xikun Zhang,Renqiang Luo,Feng Xia

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Graph Anomaly Detection, atypical graph entities, identify atypical graph, Anomaly Detection, aims to identify

备注: 14 pages, 5 figures

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at this https URL.

35. 【2605.19735】ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.19735

作者:Roman Prosvirnin,Sergei Kuznetsov,Seungmin Jin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Graph-structured retrieval-augmented generation, large language models, improve answer quality, Graph-structured retrieval-augmented, current systems rely

备注: Preprint. 6 tables

点击查看摘要

Abstract:Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

36. 【2605.19723】Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

链接https://arxiv.org/abs/2605.19723

作者:Husnain Amjad,Raja Khurram Shahzad,Aamir Shahzad,Mehwish Fatima

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:evaluating artificial intelligence, Large Language Models, artificial intelligence systems, problem-solving in education, Mathematical reasoning

备注

点击查看摘要

Abstract:Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

37. 【2605.19718】CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

链接https://arxiv.org/abs/2605.19718

作者:Francesca Padovani,Xiulin Yang,Bastian Bunzeck,Jaap Jumelet,Yevgen Matusevych,Nathan Schneider,Arianna Bisazza

类目:Computation and Language (cs.CL)

关键词:structure remain limited, syntactic structure remain, gold-standard Universal Dependencies, remain limited, language acquisition studies

备注

点击查看摘要

Abstract:CHILDES is a paramount resource for language acquisition studies -- yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child--adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child--Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

38. 【2605.19714】LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

链接https://arxiv.org/abs/2605.19714

作者:Mona H. Albaqawi,Eman M. Albalkhi,Joud A. Albaiti,Enrico Lopedoto

类目:Computation and Language (cs.CL)

关键词:contexts remains challenging, remains challenging due, Arabic financial contexts, financial contexts remains, Investor sentiment shapes

备注: Accepted at the 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7), co-located with LREC 2026, Palma de Mallorca, Spain, May 2026. ISBN: 978-2-493814-52-4

点击查看摘要

Abstract:Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

39. 【2605.19711】Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

链接https://arxiv.org/abs/2605.19711

作者:Yun Hao,Reihaneh Amooie,Wietse de Vries,Rik van Noord,Martijn Wieling

类目:Computation and Language (cs.CL)

关键词:Automatic speech recognition, Automatic speech, speech recognition, recent years, improved substantially

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

40. 【2605.19660】OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

链接https://arxiv.org/abs/2605.19660

作者:Zunhai Su,Rui Yang,Chao Zhang,Yaxiu Liu,Yifan Zhang,Wei Wu,Jing Xiong,Dayou Du,Xialie Zhuang,Yulei Qian,Yuchen Xie,Yik-Chung Wu,Hongxia Yang,Ngai Wong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:efficient deployment, rapid advancement, advancement toward long-context, long-context reasoning, intelligence has made

备注: Under review

点击查看摘要

Abstract:The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at this https URL.

41. 【2605.19645】K-Quantization and its Impact on Output Performance

链接https://arxiv.org/abs/2605.19645

作者:Robin Baki Davidsson,Pierre Nugues

类目:Computation and Language (cs.CL)

关键词:Recent advancements, NLP tasks, large language models, advancements in large, large language

备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

42. 【2605.19633】optimize_anything: A Universal API for Optimizing any Text Parameter

链接https://arxiv.org/abs/2605.19633

作者:Lakshya A Agrawal,Donghyun Lee,Shangyin Tan,Wenjie Ma,Karim Elmaaroufi,Rohit Sandadi,Sanjit A. Seshia,Koushik Sen,Dan Klein,Ion Stoica,Joseph E. Gonzalez,Omar Khattab,Alexandros G. Dimakis,Matei Zaharia

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)

关键词:match specialized tools, Gemini Flash ARC-AGI, specialized tools, tools across fundamentally, system match specialized

备注: 16 pages, 11 figures; Blog: [this https URL](https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/)

点击查看摘要

Abstract:Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at this https URL .

43. 【2605.19597】LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

链接https://arxiv.org/abs/2605.19597

作者:Ming Zhang,Qiyuan Peng,Yinxi Wei,Yujiong Shen,Kexin Tan,Yuhui Wang,Zhenghao Xiang,Junjie Ye,Zhangyue Yin,Zhiheng Xi,Shihan Dou,Tao Gui,Maxm Pan,Ruizhi Yang,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL)

关键词:rule-governed tasks require, tasks require conclusions, Evaluating large language, large language models, stated premises

备注

点击查看摘要

Abstract:Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at this https URL.

44. 【2605.19577】GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

链接https://arxiv.org/abs/2605.19577

作者:Minxuan Lv,Tiehua Mei,Tanlong Du,Junmin Chen,Zhenpeng Su,Ziyang Chen,Ziqi Wang,Zhennan Wu,Ruotong Pan,jian Liang,Ruiming Tang,Han Li

类目:Computation and Language (cs.CL)

关键词:capability-oriented post-training recipe, long-context reinforcement learning, present GoLongRL, post-training recipe, reinforcement learning

备注

点击查看摘要

Abstract:We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

45. 【2605.19576】Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

链接https://arxiv.org/abs/2605.19576

作者:Xing Zhang,Yanwei Cui,Guanghui Wang,Ziyuan Li,Wei Qiu,Bing Zhu,Peiyang He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:unbounded skill accumulation, silent failure mode, outcome-driven lifecycle management, skill libraries face, Self-evolving skill libraries

备注

点击查看摘要

Abstract:Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

46. 【2605.19575】A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

链接https://arxiv.org/abs/2605.19575

作者:Elena Mikhalkova,Anastasiya Vishnyakova,Anastasiya Drozdova,Polina Gavin,Aleksander Zhmykhov,Timofey Protasov

类目:Computation and Language (cs.CL)

关键词:article observes data, observes data analysis, notion of idiomaticity, article observes, observes data

备注

点击查看摘要

Abstract:The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

47. 【2605.19568】m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

链接https://arxiv.org/abs/2605.19568

作者:Yaoxiang Wang,Simiao Zuo,Qingguo Hu,Yucheng Ding,Yeyun Gong,Jian Jiao,Jinsong Su

类目:Computation and Language (cs.CL)

关键词:search and advertising, Matryoshka Bidirectional Encoder, pretraining, industrial information retrieval, information retrieval systems

备注: KDD 2026

点击查看摘要

Abstract:Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

48. 【2605.19523】Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

链接https://arxiv.org/abs/2605.19523

作者:Zhiyu Xu,Lean Wang,Yuanxin Liu,Lei Li,Hao Zhou,Fandong Meng,Jie Zhou,Xu Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:general multi-modal understanding, demonstrated remarkable proficiency, efficiently acquire continually, acquire continually evolving, continually evolving domain-specific

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

49. 【2605.19516】Base Models Look Human To AI Detectors

链接https://arxiv.org/abs/2605.19516

作者:Yixuan Even Xu,Ziqian Zhong,Aditi Raghunathan,Fei Fang,J. Zico Kolter

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:AI-generated text enters, real-world at scale, institutions increasingly, academic-integrity workflows, enters the real-world

备注: 39 pages, 9 figures

点击查看摘要

Abstract:As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

50. 【2605.19514】Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management

链接https://arxiv.org/abs/2605.19514

作者:Guanyu Cui,Zhewei Wei,Kun He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:works make, make the eye-catching, eye-catching claim, fixed Transformer system, fixed autoregressive Transformer

备注: Accepted to the ICML 2026 Position Paper Track

点击查看摘要

Abstract:Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

51. 【2605.19470】Drifting Objectives for Refining Discrete Diffusion Language Models

链接https://arxiv.org/abs/2605.19470

作者:Daisuke Oba,Hiroki Furuta,Naoaki Okazaki

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:diffusion language models, iteratively denoising categorical, categorical token sequences, Discrete diffusion language, recent drifting methods

备注: Project page: [this https URL](https://daioba.github.io/tokendrift/)

点击查看摘要

Abstract:Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

52. 【2605.19436】CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

链接https://arxiv.org/abs/2605.19436

作者:Ahmed Heakl,Abdelrahman M. Shaker,Youssef Mohamed,Rania Elbadry,Omar Fetouh,Fahad Shahbaz Khan,Salman Khan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:correct answer favor, correct answer, verifiable rewards, solution under reinforcement, reinforcement learning

备注: 9 pages

点击查看摘要

Abstract:When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at this https URL.

53. 【2605.19433】Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

链接https://arxiv.org/abs/2605.19433

作者:Bing Wang,Shaotian Yan,Chen Shen,kaiyuan liu,Sinan Fan,Ximing Li,Rui Miao,Xiaosong Yuan,Zhanming Shen,Jieping Ye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, hinders real-world deployment, achieved remarkable success, immense computational overhead, computational overhead hinders

备注: 26 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

54. 【2605.19416】LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

链接https://arxiv.org/abs/2605.19416

作者:Zhe Yuan,Yipeng Zhou,Jinghan Li,Xinyuan Chen,Bowen Deng,Zhiqian Chen,Liang Zhao

类目:Computation and Language (cs.CL)

关键词:Group Relative Policy, Relative Policy Optimization, modern reinforcement learning, leveraging reward normalization, reinforcement learning alignment

备注

点击查看摘要

Abstract:Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

55. 【2605.19394】EmbGen: Teaching with Reassembled Corpora

链接https://arxiv.org/abs/2605.19394

作者:Arun K Lenin,Kai Rouse,Andrea Nicastro,Anna Leontjeva

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Adapting small instruction-tuned, small instruction-tuned models, Adapting small, supervised fine-tuning, collect at scale

备注: 8 pages, 4 images (32 pages with appendix)

点击查看摘要

Abstract:Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

56. 【2605.19358】aming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

链接https://arxiv.org/abs/2605.19358

作者:Shuyu Wei,Jian Sun,Delai Qiu,Yining Wang,Shengping Liu,Jiaen Liang,Ying Fu,Wei Huang,Jitao Sang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Entropy-based deep reasoning, Language Models, Large Language, Entropy-based deep

备注

点击查看摘要

Abstract:Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

57. 【2605.19357】SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

链接https://arxiv.org/abs/2605.19357

作者:Yiyang Gu,Junwei Yang,Junyu Luo,Ye Yuan,Bin Feng,Yingce Xia,Shufang Xie,Kaili Liu,Bohan Wu,Qi Shi,Haoran Li,Beier Xiao,Zhiping Xiao,Xiao Luo,Weizhi Zhang,Philip S. Yu,Zequn Liu,Ming Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, language models, required in practice, increasingly applied

备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at this https URL.

58. 【2605.19351】PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies

链接https://arxiv.org/abs/2605.19351

作者:Ahmad Yehia,Abduallah Mohamed,Kun Qian,Tianyi Wang,Jiseop Byeon,Omar Hassanin,Christian Claudel

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains poorly characterized, Generative agents based, large language models, language models reproduce, models reproduce believable

备注: Preprint. 23 pages, 4 figures. Code and environment will be released upon publication

点击查看摘要

Abstract:Generative agents based on large language models reproduce believable human behavior in cooperative settings, but how they should reason in situations where rule-breaking may be required, such as fire evacuation or authority-supervised emergency, remains poorly characterized. We propose PAVE (Perception, Assessment, Verdict, Emulation), a novel four-module cognitive architecture that addresses this gap end to end: (i) Perception extracts a structured context with explicit authority distance, peer behaviors, and severity-tagged situational cues; (ii) Assessment scores the context along five scalars including an explicit legitimacy judgment that checks necessity, proportionality, and absence of alternatives; (iii) Verdict decides to comply or violate under a hard legitimacy gate, with a per-agent threshold elicited from the persona; (iv) Emulation enacts the verdict and scopes the violation to the rule the trigger justifies. We instantiate PAVE in Voville, a tile-based traffic environment forked from Smallville, and evaluate across three scenarios, four LLM backbones, and a focused ablation. PAVE agents satisfy four properties simultaneously: legitimate violation (only when a trigger justifies it), authority deference (officer instructions override even high legitimacy), bounded scope (violations confined to the targeted rule), and recovery (baseline restored once the trigger ends). PAVE agents make more structured and interpretable decisions than vanilla across all four properties, and human evaluators rate them as more plausible. Ablating the legitimacy gate reproduces vanilla-like failures. We release Voville, the PAVE prompts and code, and the evaluation pipeline.

59. 【2605.19346】IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

链接https://arxiv.org/abs/2605.19346

作者:Joy Bose

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Domestic Violence Act, Indian court judgments, Karnataka High Court, IPC Section, court judgments covering

备注: 8 pages, 2 figures, 5 tables. Dataset available at [this http URL](http://huggingface.co/datasets/joyboseroy/imljd) and Code at [this http URL](http://github.com/joyboseroy/imljd)

点击查看摘要

Abstract:We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at this https URL and this https URL.

60. 【2605.19344】Retrieval-Augmented Linguistic Calibration

链接https://arxiv.org/abs/2605.19344

作者:Yi-Fan Yeh,Linwei Tao,Minjing Dong,Tao Huang,Jialin Yu,Philip Torr,Chang Xu

类目:Computation and Language (cs.CL)

关键词:expressions remains underexplored, confidence expressions remains, linguistic confidence expressions, offer an intuitive, remains underexplored

备注

点击查看摘要

Abstract:Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

61. 【2605.19341】HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

链接https://arxiv.org/abs/2605.19341

作者:Emmy Liu,Varun Gangal,Michael Yu,Zhuofu Tao,Karan Singh,Sachin Kumar,Steven Y. Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:existing benchmarks operationalize, question answering, retrieval-augmented generation, inconsistently across summarization, agentic interaction

备注: HalluWorld benchmark (code and data) at [this http URL](http://github.com/DegenAI-Labs/HalluWorld)

点击查看摘要

Abstract:Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

62. 【2605.19338】STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

链接https://arxiv.org/abs/2605.19338

作者:Jiaao Wu,Xian Zhang,Hanzhang Liu,Sophia Zhang,Fan Yang,Yinpeng Dong

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Frontier AI models, led to significant, significant improvements, improvements in mathematical, mathematical reasoning

备注: 25 pages, 4 figures. Code: [this https URL](https://github.com/Julius-Woo/STAR-PolyaMath)

点击查看摘要

Abstract:Frontier AI models and multi-agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long-horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. In this paper, we introduce STAR-PólyaMath, a multi-agent framework that systematically addresses these challenges through meta-level supervision and structured Reasoner-Verifier interaction. STAR-PólyaMath is structured as an orchestrated state machine with nested challenge-step-replan loops, governed by a reasoning-free Python orchestrator that separates control from inference and bounds error propagation through trace-back and re-planning. Our key innovation is a persistent Meta-Strategist that maintains cross-attempt memory and exercises meta-level control by issuing high-level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over-rely on tools. STAR-PólyaMath achieves state-of-the-art results on all eight top-tier competition benchmarks: AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT-5.5. Ablation studies show that the gains arise from the framework's orchestration rather than from model-level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at this https URL.

63. 【2605.19316】A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

链接https://arxiv.org/abs/2605.19316

作者:Seonjeong Hwang,Jun Seo,Hyounghun Kim,Gary Geunbae Lee

类目:Computation and Language (cs.CL)

关键词:large language models, difficulty-controlled reading comprehension, leveraged large language, Recent studies, adjusting difficulty-related features

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

64. 【2605.19309】How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

链接https://arxiv.org/abs/2605.19309

作者:Yue Chen,Yihao Wang,Ziyi Tang,Keze Wang

类目:Computation and Language (cs.CL)

关键词:Document Layout Analysis, document intelligence systems, long-document question answering, pipelines provide structured, remains largely area-centric

备注: 19 pages, preprint

点击查看摘要

Abstract:Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose a lightweight output-level auditing framework that decouples probe construction, policy-driven targeting, and structure-aware diagnosis. The framework combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where perturbations interact with layout structure and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, and small structurally targeted probes cause downstream QA/retrieval degradation comparable to larger-footprint perturbations. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

65. 【2605.19285】Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

链接https://arxiv.org/abs/2605.19285

作者:Bing Wang,Rui Miao,Ximing Li,Chen Shen,Shaotian Yan,Changchun Li,Kaiyuan Liu,Xiaosong Yuan,Jieping Ye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:social media platforms, formidable challenge, Misinformation Detection, rapid spread, social media

备注: Accepted by KDD 2026. 12 pages, 8 figures. Code: [this https URL](https://github.com/wangbing1416/LONSREX)

点击查看摘要

Abstract:The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

66. 【2605.19284】Language models struggle with compartmentalization

链接https://arxiv.org/abs/2605.19284

作者:Thomas Vincent Howe,David Wingate

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:English and Swahili, Python and Haskell, presented in multiple, express propositions, formal and natural

备注: 9 pages, 8 figures, plus 9 pages of appendices. Submitted to NeurIPS 2026. Code: [this https URL](https://github.com/vinhowe/compartmentalization) . Eval data: [this https URL](https://doi.org/10.5281/zenodo.20171021)

点击查看摘要

Abstract:In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

67. 【2605.19276】OpenCompass: A Universal Evaluation Platform for Large Language Models

链接https://arxiv.org/abs/2605.19276

作者:Maosong Cao,Kai Chen,Haodong Duan,Yixiao Fang,Tong Gao,Ge Jiaye,Mo Li,Hongwei Liu,Junnan Liu,Yuan Liu,Chengqi Lyu,Han Lyu,Ningsheng Ma,Zerun Ma,Yu Sun,Zhiyong Wu,Linchen Xiao,Jun Xu,Haochen Ye,Zhaohui Yu,Yike Yuan,Songyang Zhang,Yufeng Zhao,Fengzhe Zhou,Peiheng Zhou,Dongsheng Zhu,Lin Zhu,Jingming Zhuo

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:task-specific small-scale models, recent years, large language models, field of artificial, artificial intelligence

备注

点击查看摘要

Abstract:In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

68. 【2605.19274】Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

链接https://arxiv.org/abs/2605.19274

作者:Somnath Banerjee,Pranav Jha,Rima Hazra,Animesh Mukherjee

类目:Computation and Language (cs.CL)

关键词:LLMs deployed multilingually, deployed multilingually, LLMs deployed, English, English explanations

备注

点击查看摘要

Abstract:LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

69. 【2605.19270】DECOR: Auditing LLM Deception via Information Manipulation Theory

链接https://arxiv.org/abs/2605.19270

作者:Linyue Cai,Samuel Yeh,Jwala Dhamala,Rahul Gupta,Sharon Li

类目:Computation and Language (cs.CL)

关键词:Large language models, subtly manipulating truthful, Large language, manipulating truthful information, shifting focus

备注

点击查看摘要

Abstract:Large language models can deceive by subtly manipulating truthful information -- omitting key facts, shifting focus, or obscuring meaning -- making such behavior difficult to detect. Existing black-box methods rely on coarse-grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single-turn and multi-turn deception detection benchmarks spanning real-world domains, and show that DECOR achieves state-of-the-art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine-grained, theory-grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

70. 【2605.19266】FormalASR: End-to-End Spoken Chinese to Formal Text

链接https://arxiv.org/abs/2605.19266

作者:Wanyi Ning,Yinshang Guo,Haitao Qian,Jiyuan Cheng,Weiyuan Feng,Yufei Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automatic speech recognition, downstream writing-oriented applications, informal spoken structures, Automatic speech, filler words

备注

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

71. 【2605.19234】AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

链接https://arxiv.org/abs/2605.19234

作者:Miguel A. Jiménez-Crespo,Stephanie Rodriguez,Alejandro Jaume Losa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reshaping translation practices, language access managers, language access, rapid emergence, technologies is reshaping

备注: 11 pages, 2 tables, Convergence Conference 2026

点击查看摘要

Abstract:The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

72. 【2605.19228】Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

链接https://arxiv.org/abs/2605.19228

作者:Xiaoou Liu,Tiejin Chen,Dengjia Zhang,Yaqing Wang,Lu Cheng,Hua Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, fail remains difficult, achieved strong performance, Language Models

备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

73. 【2605.19224】Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

链接https://arxiv.org/abs/2605.19224

作者:Aditya R. Vaidya,Richard J. Antonello,Alexander G. Huth

类目:Computation and Language (cs.CL)

关键词:Neuroscientists have recently, recently turned, turned to intracranial, human experiments, fine spatial

备注

点击查看摘要

Abstract:Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

74. 【2605.19220】Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

链接https://arxiv.org/abs/2605.19220

作者:Tiejin Chen,Longchao Da,Xiaoou Liu,Hua Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, deploying Large Language, Large Language, high-stakes domains, Uncertainty Quantification

备注: Accepted by ICML 2026 Position Paper Track

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

75. 【2605.19196】me to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

链接https://arxiv.org/abs/2605.19196

作者:Leyao Wang,Yanan He,Peng Chen,Asaf Yehudai,Yixin Liu,Rex Ying,Michal Shmueli-Scheuer,Arman Cohan

类目:Computation and Language (cs.CL)

关键词:producing evidence-grounded reports, increasingly automate complex, automate complex information-seeking, Deep research agents, complex information-seeking tasks

备注

点击查看摘要

Abstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

76. 【2605.19194】MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

链接https://arxiv.org/abs/2605.19194

作者:Rui Chu

类目:Computation and Language (cs.CL)

关键词:large language model, improving large language, framework has shown, language model, performance by aggregating

备注

点击查看摘要

Abstract:The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

77. 【2605.19173】Prompting language influences diagnostic reasoning and accuracy of large language models

链接https://arxiv.org/abs/2605.19173

作者:Adrien Bazoge,Josselin Corvellec,Sofiane Djillali Sid-Ahmed,Pierre-Antoine Gourraud

类目:Computation and Language (cs.CL)

关键词:clinical decision support, Large language models, decision support, leaving their reliability, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

78. 【2605.19149】Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

链接https://arxiv.org/abs/2605.19149

作者:Rishi Jha,Harold Triedman,Arkaprabha Bhattacharya,Vitaly Shmatikov

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:computer and Web, Web use inevitably, inaccessible webpages, missing files, operating with computer

备注: 32 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

Comments:
32 pages, 8 figures, 4 tables

Subjects:

Computation and Language (cs.CL); Cryptography and Security (cs.CR)

Cite as:
arXiv:2605.19149 [cs.CL]

(or
arXiv:2605.19149v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.19149

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2605.19141】GRASP: Deterministic argument ranking in interaction graphs

链接https://arxiv.org/abs/2605.19141

作者:Diganta Misra,Antonio Orvieto,Rediet Abebe,Volkan Cevher

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:Large language models, Large language, increasingly deployed, deployed as automated, automated judges

备注: Preprint

点击查看摘要

Abstract:Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

80. 【2605.19130】EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

链接https://arxiv.org/abs/2605.19130

作者:Dongyan Lin,Phillip Rust,Angel Villar Corrales,Alvin W. M. Tan,Mahi Luthra,Charles-Éric Saint-James,Rashel Moritz,Sheila Krogh-Jespersen,Vanessa Stark,Surya Parimi,Jiayi Shen,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Tom Fizycki,Nicolas Hamilakis,Manel Khentout,Sho Tsuji,Balázs Kégl,Juan Pino,Michael C. Frank,Emmanuel Dupoux

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Children acquire language, Children acquire, limited visuo-linguistic input, acquire language grounding, large multimodal models

备注

点击查看摘要

Abstract:Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

81. 【2605.19099】DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

链接https://arxiv.org/abs/2605.19099

作者:Yuxuan Gao,Megan Wang,Yi Ling Yu,Zijian Carl Ma,Ao Qu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:long-horizon agentic workflows, introduce DecisionBench, agentic workflows, long-horizon agentic, benchmark substrate

备注: 28 pages, 9 figures, 11 tables. Code and data: [this https URL](https://huggingface.co/decisionbench)

点击查看摘要

Abstract:We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| = 0.010, p = 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

82. 【2605.19092】Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

链接https://arxiv.org/abs/2605.19092

作者:Alexander Boesgaard Lorup(Openhagen)

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reasoning systems increasingly, increasingly separate intermediate, separate intermediate computation, systems increasingly separate, creating evaluation cases

备注: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

83. 【2605.19077】ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

链接https://arxiv.org/abs/2605.19077

作者:Yanjun Lin,Zimo Xiao,Kartik Natarajan,Mahesh Sankaranarayanan,Niraj Nawanit,Rakshit Parashar,Austin Zhang,Karthik Konaraddi,Rishita Mote,Wei Niu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:require predictable behavior, Task-oriented dialogue systems, moderately-sized LLMs needed, handling transactions, service requests

备注: Accepted at TrustNLP Workshop at ACL 2026

点击查看摘要

Abstract:Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

84. 【2605.19069】Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

链接https://arxiv.org/abs/2605.19069

作者:Sajjad Abdoli,Ghassan Al-Sumaidaee,Clayton W. Taylor,Ahmad(MAD)ElShiekh,Ahmed Rashad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:automatic speech recognition, Word Error Rate, single Word Error, English, Existing commercial ASR

备注

点击查看摘要

Abstract:Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance. We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring costs by approximately 91\% relative to exhaustive scoring. We evaluate the systems on both WER and BERTScore, arguing that BERTScore is a more reliable metric for Arabic and Persian pairs where transliteration variance causes WER to penalise semantically correct transcriptions. ElevenLabs Scribe v2 achieves the lowest WER across all four language pairs (13.2% overall; 13.1% on Egyptian Arabic) and leads on BERTScore (0.936 overall). We further demonstrate that difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and that BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The benchmarking dataset is publicly available at this https URL.

85. 【2605.19066】he Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

链接https://arxiv.org/abs/2605.19066

作者:Vukosi Marivate

类目:Computation and Language (cs.CL)

关键词:experienced explosive growth, massively multilingual models, natural language processing, past decade, explosive growth

备注: Under Review

点击查看摘要

Abstract:Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

86. 【2605.19008】Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

链接https://arxiv.org/abs/2605.19008

作者:Anis Radianis

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Modern language-model training, Modern language-model, degraded runs, exposed to instability, runtime-stress conditions

备注

点击查看摘要

Abstract:Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2605.19008 [cs.AI]

(or
arXiv:2605.19008v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.19008

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
87. 【2605.18936】FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data

链接https://arxiv.org/abs/2605.18936

作者:Nuredin Ali Abdelkadir,Anjali Ratnam,Zeerak Talat,Stevie Chancellor

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Social media text, train Machine Learning, Social media, train Machine, Machine Learning

备注: Association for Computational Linguistics (ACL) 2026 Main Conference

点击查看摘要

Abstract:Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. However, sharing this sensitive data poses privacy risks and limits the growth of benchmark datasets. We comprehensively evaluate whether privacy-preserving ML techniques can enable safer data sharing while preserving performance. Specifically, we apply federated learning (FL) and Differentially Private FL for two widely-studied mental health prediction tasks: depression detection on X (Twitter) and suicide crisis detection on Reddit. We simulate realistic data-sharing scenarios by treating each user as a client in a non-IID setting, evaluating across different client fractions, aggregation strategies, and privacy budgets. While FL achieves comparable performance to centralized training (centralized F1 = 85.63; best FL model F1 = 83.16) on depression identification, we find that Differentially Private FL has a large performance-privacy trade-off (up to F1 = 27.01 drop) even with low levels of noise (epsilon = 50). This is due to the distortion of highly informative yet sparse mental health linguistic markers related to mental health, like health topics and emotion words. This research empirically demonstrates the potential and limitations of current privacy preservation techniques for mental health inference tasks.

88. 【2605.18904】Dynamic Model Merging Made Slim

链接https://arxiv.org/abs/2605.18904

作者:Guodong Du,Wanyu Lin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:original data, Model merging enables, enables the reuse, joint training, training or access

备注

点击查看摘要

Abstract:Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

89. 【2605.18879】ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

链接https://arxiv.org/abs/2605.18879

作者:Yujie Lin,Chengyi Yang,Zhishang Xiang,Yiping Song,Jinsong Su

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, induce harmful generations, massive web corpora, Large language, retain sensitive information

备注

点击查看摘要

Abstract:Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: this https URL.

90. 【2605.18864】SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

链接https://arxiv.org/abs/2605.18864

作者:Chanuk Lee,Minki Kang,Sung Ju Hwang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent studies observe, large language models, yield comparable gains, reliably improves pass, RLVR genuinely enables

备注: Preprint

点击查看摘要

Abstract:Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at this https URL.

91. 【2605.18856】SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

链接https://arxiv.org/abs/2605.18856

作者:Anay Chauhan,Gurucharan Marthi Krishna Kumar,Arion Das,Amit Dhanda,Vinija Jain,Aman Chadha,Amitava Das

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:High Bandwidth Memory, repeated High Bandwidth, resident memory grows, High Bandwidth, Bandwidth Memory

备注

点击查看摘要

Abstract:Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)

ACMclasses:
I.2.6; I.2.7

Cite as:
arXiv:2605.18856 [cs.LG]

(or
arXiv:2605.18856v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.18856

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
92. 【2605.18852】Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

链接https://arxiv.org/abs/2605.18852

作者:Qinwu Xu,Zhuoheng Li,Jessie Salas

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:presents significant challenges, large language models, multimodal large language, language models, presents significant

备注

点击查看摘要

Abstract:Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

93. 【2605.18840】he Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

链接https://arxiv.org/abs/2605.18840

作者:Adil Amin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Leaderboards rank frontier, Leaderboards rank, GPQA Diamond scores, rank frontier models, independent axes

备注: 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." Code: [this https URL](https://github.com/adilamin89/cape-scaling) . Dashboard: [this https URL](https://zehenlabs.com/cape/)

点击查看摘要

Abstract:Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p 10^{-6}$), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first ($h$: $+11.2 \to -4.7$, 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static -- it cascades. Six open-weight architectures confirm a second capability transition at 30--72B, and SWE-bench is now saturating while HLE and instruction-following retain discriminatory spread -- signaling the next axis rotation. We provide a three-level playbook (locate, diagnose, rotate), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per-lab coupling slopes vary $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample ($r$ rises from $+0.72$ to $+0.75$). An interactive dashboard provides phase classification with actionable recommendations, $h$-field diagnostics, per-lab coupling trajectories, ODE-based scaling predictions, benchmark rotation guidance, self-steering demo, and live tracking of all seven predictions: this https URL.

94. 【2605.18838】Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

链接https://arxiv.org/abs/2605.18838

作者:Adil Amin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:laws predict loss, Scaling laws predict, laws predict, capabilities interact, predict loss

备注: 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." Code: [this https URL](https://github.com/adilamin89/cape-scaling) . Dashboard: [this https URL](https://zehenlabs.com/cape/)

点击查看摘要

Abstract:Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale $N_c$, capabilities anticorrelate; above it, they cooperate. $N_c \approx 3.5$B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift $N_c$ independently: curated training eliminated the coupling dip between Qwen generations ($0.025 \to 0.830$ at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier ($r = +0.72$, 34 models, 10 labs). Code, data, and an open-source activation-steering tool for any open-weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: this https URL.

95. 【2605.18824】Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

链接https://arxiv.org/abs/2605.18824

作者:Mohammed Saidul Islam,Negin Baghbanzadeh,Farnaz Kohankhaki,Afshin Cheraghi,Ali Kore,Shayaan Mehdi,Elham Dolatabadi,Arash Afkanpour

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:lack comprehensive coverage, rely on aggregate, aggregate scores, lack comprehensive, fine-grained evaluation

备注

点击查看摘要

Abstract:Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

96. 【2605.18812】PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

链接https://arxiv.org/abs/2605.18812

作者:Varun Kotte

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Modern NLP, named entity recognition, retrieval-augmented generation, named entity, entity recognition

备注

点击查看摘要

Abstract:Modern NLP and LLM systems are pipelines: named entity recognition (NER) - entity disambiguation (NED) - entity typing, retrieval-augmented generation (retriever - reader), and agentic chains of planner - tool - critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER - NED - entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

97. 【2605.18808】Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

链接https://arxiv.org/abs/2605.18808

作者:Joao Paulo Cavalcante Presa,Savio Salvarino Teles de Oliveira

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, instruction-tuned large language, mid-depth residual streams, language models, literary primitives

备注: 36 pages, 6 figures

点击查看摘要

Abstract:We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

98. 【2605.18799】ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

链接https://arxiv.org/abs/2605.18799

作者:Wanghan Xu,Yuhao Zhou,Hengyuan Zhao,Shuo Li,Dianzhi Yu,Zhenfei Yin,Yaowen Hu,Fengli Xu,Wanli Ouyang,Wenlong Zhang,Lei Bai

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, initially correct scientific, correct scientific solution, Large language, answering incorrectly

备注

点击查看摘要

Abstract:Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at this https URL .

99. 【2605.18796】UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

链接https://arxiv.org/abs/2605.18796

作者:Varun Kotte

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:per-workload threshold tuning, uncalibrated confidence scores, promise lower inference, require per-workload threshold, sending easy queries

备注: 9 pages, 2 figures, 4 tables. Code: [this https URL](https://github.com/varunkotte6/ucci)

点击查看摘要

Abstract:LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

100. 【2605.18792】rust or Abstain? A Self-Aware RAG Approach

链接https://arxiv.org/abs/2605.18792

作者:Xi Zhu,Ziqi Wang,Kai Mei,Wujiang Xu,Minghao Guo,Bangji Yang,Jiajun Fan,Dimitris N. Metaxas

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, incorporating external evidence, Retrieval-augmented generation, retrieved contextual knowledge, improves large language

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER's risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at this https URL.

101. 【2605.18772】Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

链接https://arxiv.org/abs/2605.18772

作者:Gongbo Zhang,Yifan Peng,Chunhua Weng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, Retrieval-Augmented Generation, grounding generation, external knowledge, factual accuracy

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

102. 【2605.18769】ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.18769

作者:Gibson Nkhata,Uttamasha Anjally Oyshi,Quan Mai,Susan Gauch

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Personalized Retrieval-Augmented Generation, accurately selecting user-relevant, selecting user-relevant documents, enhance personalized generation, Personalized Retrieval-Augmented

备注: 17 pages, 2 figures, to be published in the proceedings of ACL 2026

点击查看摘要

Abstract:Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user's profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

103. 【2605.18766】Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

链接https://arxiv.org/abs/2605.18766

作者:Taehee Kim,Seungbin Yang,Jihwan Kim,Jaegul Choo

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieving relevant tables, accurately answering questions, Retrieving relevant, natural language query, natural language

备注: ACL 2026 Findings

点击查看摘要

Abstract:Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at this https URL.

信息检索

1. 【2605.20157】SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

链接https://arxiv.org/abs/2605.20157

作者:Sudheer Tubati,Amit Goyal

类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Music streaming fraud, bad actors artificially, actors artificially inflate, artificially inflate stream, inflate stream counts

备注

点击查看摘要

Abstract:Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

2. 【2605.20123】BiRD: A Bidirectional Ranking Defense Mechanism for Retrieval Augmented Generation

链接https://arxiv.org/abs/2605.20123

作者:Chengcai Gao,Zhihong Sun,Xiaochuan Shi,Qiufeng Wang,Chao Liang

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Retrieval-Augmented Generation, growing adoption, adoption of Retrieval-Augmented, rise in adversarial, RAG

备注: 17 pages, 10 figures and 8 tables

点击查看摘要

Abstract:The growing adoption of Retrieval-Augmented Generation (RAG) has led to a rise in adversarial attacks. Existing defenses, relying on semantic analysis or voting, face a trade-off between high computational cost and limited robustness under strong poisoning attacks. Their fundamental limitation is the exclusive focus on semantic content relevance, while neglecting the retrieval context that is critically defined by ranking structures. To this end, we investigate the bidirectional ranking behavior of poisoned and benign documents, and discover a key discriminative pattern: poisoned documents exhibit significantly stronger alignment between their backward rankings and the query's forward ranking. Capitalizing on this, we propose BiRD, a bidirectional ranking defense mechanism built upon a dual-signal framework that leverages forward ranking to assess semantic content relevance and backward ranking to quantify ranking context consistency. This design directly addresses the fundamental limitation of prior approaches, enabling simultaneous efficiency and robustness. Extensive evaluation across 3 datasets with 3 retrievers and 3 LLMs under 2 attack scenarios validates BiRD's effectiveness. Notably, BiRD reduces the attack success rate of PoisonedRAG by up to 54% while simultaneously improving task accuracy by up to 56%, with average additional latency under 1 second.

3. 【2605.19847】Auditing Privacy in Multi-Tenant RAG under Account Collusion

链接https://arxiv.org/abs/2605.19847

作者:Florian A. D. Burnat,Brittany I. Davidson

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Multi-tenant retrieval-augmented generation, Multi-tenant retrieval-augmented, operative leakage boundary, services advertise per-account, advertise per-account differential

备注

点击查看摘要

Abstract:Multi-tenant retrieval-augmented generation (RAG) services advertise per-account differential privacy as the operative leakage boundary: each account's queries are guaranteed to satisfy $(\varepsilon_{\text{acc}}, \delta_{\text{acc}})$-DP with respect to the index. We identify same-index multi-account collusion as a privacy-boundary failure: for $k$ same-tenant accounts coordinating against the tenant's index -- the operative regime -- known DP composition theory implies joint leakage degrades unconditionally at rate $\Theta(\sqrt{k} \cdot \varepsilon_{\text{acc}})$ for Gaussian-noised retrieval. Cross-tenant and external collusion match the rate only under explicit access-control failure (M4); without M4 these regimes have zero leakage by design and reduce to an architectural audit, not a DP audit. We exhibit an attack realizing the rate and derive a RAG-specific MIA prediction we test empirically. To make this per-account/joint gap auditable, we design the first audit protocol that operates against unmodified RAG deployments and issues a quantitative $(\textsf{PASS}, \varepsilon_{\text{audit}})$ verdict for the retrieval-score channel -- the noise-then-select step the per-account DP guarantee actually covers -- without index disclosure, pipeline redesign, or model-weight exposure. Generation-channel privacy (LLM output conditioned on selected documents) is a separate audit predicate that should compose with ours; we explicitly scope it out. The protocol composes generic cryptographic primitives (Merkle ledgers, ZK function-application proofs, Gaussian noise attestations) with six RAG-specific primitives (embedder commitment, index-content vector commitment, per-account query ledger, noise-then-select attestation, cross-tenant containment proof, coalition-size estimator) and supports both closed-form audit bounds and Rényi-DP moments-accountant tracking.

4. 【2605.19651】Divergence Meets Consensus: A Multi-Source Negative Sampling Framework for Sequential Recommendation

链接https://arxiv.org/abs/2605.19651

作者:Yuanzi Li,Lingjie Wang,Jingyu Zhao,Zihang Tian,Yuhan Wang,Lei Wang,Xu Chen

类目:Information Retrieval (cs.IR)

关键词:training sequential recommendation, Negative sampling, sequential recommendation models, implicit feedback, significant for training

备注

点击查看摘要

Abstract:Negative sampling is significant for training sequential recommendation models under implicit feedback. The predominant strategy, self-guided hard negative sampling, selects negatives based on the model's current state but suffers from three limitations: (1) the coupling between sampling and model updates triggers a vicious cycle that drives the model into local optima; (2) relying on current model parameters narrows sampling to a small region of the item space, reducing diversity and harming generalization; (3) identifying a hard negative requires scoring the entire candidate pool, causing substantial computational overhead with minimal information gain. To address these challenges, we propose MDCNS (Multi-source Divergence-Consensus for Negative Sampling), a novel "Teacher-Peer-Self" framework inspired by Vygotsky's Zone of Proximal Development (ZPD) theory. The proposed method comprises three components, including multi-source scoring, divergence re-ranking, and consensus distillation. Firstly, multi-source scoring incorporates peer and ensemble teacher models to inject external negative signals and break the self-reinforcement loop. Then, divergence re-ranking exploits prediction discrepancy between self and peer models to enhance sampling diversity. Finally, consensus distillation aligns the self model with the teacher via KL divergence, simultaneously improving computational cost utilization. Extensive experiments on six real-world datasets and five backbone models show that MDCNS consistently outperforms state-of-the-art negative sampling methods, demonstrating strong effectiveness and generalization.

5. 【2605.19628】Understanding Wacky Weights: A Dissection of SPLADE's Learned Term Importance

链接https://arxiv.org/abs/2605.19628

作者:Gregory Polyakov,Harrisen Scells,Carsten Eickhoff

类目:Information Retrieval (cs.IR)

关键词:Learned sparse retrieval, Learned sparse, inverted indices, neural architectures, efficiency of inverted

备注: 11 pages, 4 figures, accepted at SIGIR 2026

点击查看摘要

Abstract:Learned sparse retrieval models such as SPLADE combine the effectiveness of neural architectures with the efficiency of inverted indices. As these models assign weights to terms from a fixed vocabulary, interpretability is often touted as a major benefit of these models. However, the emergence of wacky weights, i.e., expansion terms that appear semantically unrelated to the input, limits interpretability. While prior research has anecdotally observed this phenomenon, there is a lack of systematic understanding regarding their origins, prevalence, and contribution to retrieval effectiveness. In this paper, we reproduce SPLADE-v2 to systematically investigate wacky weights across the SPLADE family of models. We present a comprehensive dissection of wacky weights, providing a formal definition of wackiness based on the lexical utility of expansion terms. Furthermore, we introduce a novel measure to compare the prevalence of these tokens across models with varying vocabularies and sparsity levels. Beyond reproducing the original SPLADE-v2, we train it with various loss functions, datasets, and backbone transformers to isolate the factors contributing to wackiness. Our results show that larger vocabularies are associated with a higher prevalence of wacky tokens, while stricter sparsity regularizers are associated with lower prevalence. Finally, we find that wacky weights are used primarily for in-domain effectiveness rather than out-of-domain generalization.

6. 【2605.18920】SynGR: Unleashing the Potential of Cross-Modal Synergy for Generative Recommendation

链接https://arxiv.org/abs/2605.18920

作者:Wei Chen,Xingyu Guo,Shuang Li,Fuwei Zhang,Meng Yuan,Jing Fan,Zhao Zhang,Deqing Wang,Fuzhen Zhuang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:promising paradigm, paradigm by formulating, formulating item recommendation, Generative Recommendation, generation task

备注: Accepted by ICML2026, 15 pages

点击查看摘要

Abstract:Generative Recommendation (GR) has emerged as a promising paradigm by formulating item recommendation as a sequence-to-sequence generation task over item identifiers. Recent studies have incorporated multimodal signals to provide richer token-level evidence for generation. However, existing approaches largely rely on alignment-centric fusion and underexplore synergistic information across modalities. In practice, synergistic information plays a critical role in capturing emergent item properties that cannot be inferred from any single modality alone. Such properties encode intrinsic item semantics and guide user preferences, enabling models to move beyond surface-level feature matching. To address this limitation, we propose \textbf{SynGR}, a synergistic generative recommendation framework that explicitly encourages the exploitation of cross-modal dependencies during generation. By constraining overreliance on dominant modalities, SynGR enables the model to capture emergent item semantics beyond shared or modality-specific signals. Extensive experiments across three benchmark datasets demonstrate that SynGR achieves superior performance.

7. 【2605.18857】he 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

链接https://arxiv.org/abs/2605.18857

作者:Vyzantinos Repantis,Harshvardhan Singh,Tony Joseph,Cien Zhang,Akash Vishwakarma,Svetlana Karslioglu,Michael Wyatt Thot,Ameya Gawde

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:discard irrelevant information, discard irrelevant, search results, human consumers, retrieval

备注: 12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, [this https URL](https://iclr-blogposts.github.io/2026/blog/2026/bits-over-random/)

点击查看摘要

Abstract:For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as $BoR = \log_{2}\left(\frac{\mathrm{P}_{obs}}{\mathrm{P}_{rand}}\right)$, where $\mathrm{P}_{rand}$ is the hypergeometric baseline for the chosen success rule (here, coverage: $ \geq1 $ relevant in top-$K$). On the 20 Newsgroups dataset, BM25 and SPLADE both report $99$% success at $K=100$ (coverage), yet $BoR \approx 0$, indicating random-level selectivity at that depth. When the expected coverage ratio $\left(\frac{K \cdot \bar{R}_{q}}{N}\right)$ exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at $K=100$, consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

Comments:
12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, this https URL

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

MSC classes:
68P20, 68T50, 94A17

ACMclasses:
H.3.3; I.2.7; I.2.11; I.2.6

Cite as:
arXiv:2605.18857 [cs.IR]

(or
arXiv:2605.18857v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2605.18857

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
ICLR Blog Track 2026, https://iclr.cc/virtual/2026/poster/10012083

8. 【2605.18850】KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

链接https://arxiv.org/abs/2605.18850

作者:Adrian Cierpka,Mohammad Shafiqul Islam,Johannes Steinhülb,Eric Dietriche Sesso Domtchoueng,Michael Selzer,Arnd Koeppe

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:data, privacy-sensitive research data, research data, access, Kadi research data

备注

点击查看摘要

Abstract:We introduce KadiAssistant, a privacy-by-design AI assistant integrated into the Kadi research data ecosystem, enabling researchers to efficiently access, aggregate, and synthesize information from heterogeneous, privacy-sensitive research data. Interdisciplinary fields such as materials science bring together disciplines with their own terminology and standards. While this convergence fuels innovation, it also makes it increasingly difficult to connect and access knowledge, as data are distributed across disciplines, organizations, and individuals. For example, battery research combines electrochemical measurements, materials characterization data, physics-based simulations, and manufacturing parameters, each using different formats, vocabularies, and standards. Efficiently storing and sharing such heterogeneous data via research data platforms, such as Kadi4Mat, demands domain knowledge, technical expertise, and familiarity with metadata schemas and interfaces. Research data also vary in sensitivity: newly generated 'warm' data are often private, whereas published 'cold' data are usually openly accessible. The Kadi ecosystem offers fine-grained access control needed for sensitive data. A solution for efficient information retrieval in Kadi must therefore respect the fine-grained access permissions. To address these intertwined challenges of information retrieval, strong data privacy, and complex access control, KadiAssistant combines a self-hosted large language model (LLM) with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer. KadiAssistant therefore bridges terminology and standards, lowers access barriers for researchers, and strengthens the Findable pillar of FAIR data principles.

9. 【2605.18827】Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

链接https://arxiv.org/abs/2605.18827

作者:Prateek Biswas,Dhaval Patel,Vedant Khandelwal,Shuxin Lin,Amit Sheth

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)

关键词:evaluate small language, deployed language-model systems, language-model systems increasingly, systems increasingly rely, repeated model calls

备注: 28 Pages, 18 Figures

点击查看摘要

Abstract:Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

10. 【2605.18812】PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

链接https://arxiv.org/abs/2605.18812

作者:Varun Kotte

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Modern NLP, named entity recognition, retrieval-augmented generation, named entity, entity recognition

备注

点击查看摘要

Abstract:Modern NLP and LLM systems are pipelines: named entity recognition (NER) - entity disambiguation (NED) - entity typing, retrieval-augmented generation (retriever - reader), and agentic chains of planner - tool - critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER - NED - entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

11. 【2605.18806】owards FairRAG: Preventing Representational Harm in Retrieval-Augmented Generation by Enforcing Fair Exposure at Retrieval Time

链接https://arxiv.org/abs/2605.18806

作者:Riddhi Tikoo

类目:Information Retrieval (cs.IR)

关键词:Large Language Model, Large Language, Language Model, Representative Stochastic, model hallucination

备注

点击查看摘要

Abstract:As Large Language Model (LLM) integration has accelerated in high-stakes domains, model hallucination is a critical issue. Retrieval-augmented generation (RAG) is a technique for addressing hallucination; however, RAG's multi-component pipeline introduces vulnerabilities where biases can be introduced. This study considers two previously developed utility-focused ranking strategies (Standard and Stochastic) alongside two proposed exposure-aware approaches (Forced-Exposure and Representative Stochastic). Using the TREC 2022 Fair Ranking Dataset, which contains Wikipedia articles annotated as protected or non-protected, the LLM was asked to identify relevant articles with citations for four scenario-based QA prompts. The retrieval rankings and the generated outputs were evaluated for exposure bias and utility across all ranking methods. Overall, the Representative Stochastic ranker resulted in a statistically significant near-parity average exposure, acknowledging that relevance scores initially produced during retrieval are already shaped by representational bias, whereas the other rankers assume those scores are unbiased. Across all the methods of document ranking, generation demographic parity closely mirrored the exposure parity, reinforcing that representational bias in RAG systems is driven by retrieval and propagates to generation. These findings highlight that retrieval ranking is a critical point for mitigating downstream bias and propose a Representative Stochastic ranker that reintroduces fairness in RAG systems.

12. 【2605.18805】RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

链接https://arxiv.org/abs/2605.18805

作者:Imad Aouali,Flavian Vasile,Otmane Sakhi,Alexandre Gilotte,Benjamin Heymann

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:increasingly produce structured, LLM recommendation agents, produce structured recommendation, agents increasingly produce, LLM recommendation

备注: Benchmark on LLM Recommendation Agents

点击查看摘要

Abstract:LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.

13. 【2605.18801】Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

链接https://arxiv.org/abs/2605.18801

作者:Shiqiang Wang,Herbert Woisetschläger,Hans Arno Jacobsen,Mingyue Ji

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Data, LLM, LLM workflow, large language models, large language

备注: Accepted to ICML 2026 Position Paper Track

点击查看摘要

Abstract:Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

14. 【2605.18792】rust or Abstain? A Self-Aware RAG Approach

链接https://arxiv.org/abs/2605.18792

作者:Xi Zhu,Ziqi Wang,Kai Mei,Wujiang Xu,Minghao Guo,Bangji Yang,Jiajun Fan,Dimitris N. Metaxas

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, incorporating external evidence, Retrieval-augmented generation, retrieved contextual knowledge, improves large language

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER's risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at this https URL.

15. 【2605.18780】A Reproducibility Analysis of PO4ISR: Diagnosing and Mitigating Semantic Drift in LLM-Based Session Recommendation

链接https://arxiv.org/abs/2605.18780

作者:Aditya Tiwari,Konduri Naga Lakshmi Rekha,Rajesh Kumar Mundotiya

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Reasoning-based Large Language, Large Language Models, Reasoning-based Large, Large Language, Language Models

备注

点击查看摘要

Abstract:Reasoning-based Large Language Models (LLMs) like PO4ISR have set new benchmarks in session-based recommendation. However, the reproducibility of their reasoning capabilities across diverse semantic domains remains unexplored. In this work, we conduct a rigorous reproducibility study of PO4ISR to assess its generalization limits. Our analysis reveals a critical failure mode: standard reasoning prompts suffer from severe contextual drift in long sessions, leading to performance degradation on semantically complex datasets like Games and Bundle. To quantify and resolve this stability gap, we introduce PO4ISR++, a robustness-enhanced implementation that integrates reflexive prompting and consistent rank detection. Unlike the original static prompting strategy, our approach dynamically adapts to cross-domain cues. We benchmark both the original implementation and our robust variant on ML-1M, Games, and Bundle. Our results confirm that while the original model struggles in new domains, our reproducible extension restores performance, yielding a stabilized gain of up to 54% on Games and 96% on Bundle. We release open-source artifacts, including the reproduced baseline and our enhanced framework, to facilitate reliable future research in LLM-based recommendation.

16. 【2605.18776】Mask-to-Correct$^+$: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction

链接https://arxiv.org/abs/2605.18776

作者:Payel Santra,Lavisha Sharma,Madhusudan Ghosh,Partha Basuchowdhuri

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:social media highlights, automated fact correction, automated fact, rapid spread, spread of misinformation

备注

点击查看摘要

Abstract:The rapid spread of misinformation on social media highlights the need for robust, automated fact correction frameworks. However, existing works rely on supervised learning from manually annotated claim-evidence pairs, which are scarce and prone to biases, limiting their generalization across domains. Moreover, these methods overlook semantic faithfulness in their correction process. To address these challenges, we propose Mask-to-Correct (M$_2$C), a training-free, inference-only Retrieval Augmented Generation (RAG) based framework that leverages diversity-aware masking to identify erroneous spans of claims and evaluate the faithfulness of corrections using retrieved evidence. However, the effectiveness of RAG heavily depends on the choice of retriever, which may vary across queries. To mitigate this, we further introduce M$_2$C$^+$, an ensemble-based framework that combines corrections across multiple rankers to reduce retrieval bias and improve robustness. Extensive experiments on the benchmark datasets demonstrate that our proposed frameworks consistently outperform all baselines, achieving up to 14% improvement in SARI scores, without using gold evidence.

17. 【2605.18775】Query-Aware Flow Diffusion for Graph-Based RAG with Retrieval Guarantees

链接https://arxiv.org/abs/2605.18775

作者:Zhuoping Zhou,Davoud Ataee Tarzanagh,Sima Didari,Wenjun Hu,Baruch Gutow,Oxana Verkholyak,Masoud Faraki,Heng Hao,Hankyu Moon,Seungjai Min

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Graph-based Retrieval-Augmented Generation, systems leverage interconnected, leverage interconnected knowledge, interconnected knowledge structures, capture complex relationships

备注: Published at the International Conference on Learning Representations (ICLR) 2026. 38 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (RAG) systems leverage interconnected knowledge structures to capture complex relationships that flat retrieval struggles with, enabling multi-hop reasoning. Yet most existing graph-based methods suffer from (i) heuristic designs lacking theoretical guarantees for subgraph quality or relevance and/or (ii) the use of static exploration strategies that ignore the query's holistic meaning, retrieving neighborhoods or communities regardless of intent. We propose Query-Aware Flow Diffusion RAG (QAFD-RAG), a training-free framework that dynamically adapts graph traversal to each query's holistic semantics. The central innovation is query-aware traversal: during graph exploration, edges are dynamically weighted by how well their endpoints align with the query's embedding, guiding flow along semantically relevant paths while avoiding structurally connected but irrelevant regions. These query-specific reasoning subgraphs enable the first statistical guarantees for query-aware graph retrieval, showing that QAFD-RAG recovers relevant subgraphs with high probability under mild signal-to-noise conditions. The algorithm converges exponentially fast, with complexity scaling with the retrieved subgraph size rather than the full graph. Experiments on question answering and text-to-SQL tasks demonstrate consistent improvements over state-of-the-art graph-based RAG methods.

18. 【2605.18774】M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

链接https://arxiv.org/abs/2605.18774

作者:Joongmin Shin,Jeongbae Park,Jaehyung Seo,Heuiseok Lim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:document true structure, chunk boundaries follow, retrieval-augmented generation, depends heavily, true structure

备注: Accepted to CVPR2026 Main

点击查看摘要

Abstract:In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

19. 【2605.18772】Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

链接https://arxiv.org/abs/2605.18772

作者:Gongbo Zhang,Yifan Peng,Chunhua Weng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, Retrieval-Augmented Generation, grounding generation, external knowledge, factual accuracy

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

20. 【2605.18771】LWGR: Lagrangian-Constrained Personalized World Knowledge for Generative Recommendation

链接https://arxiv.org/abs/2605.18771

作者:Lingyu Mu,Hao Deng,Haibo Xing,Kaican Lin,Zhitong Zhu,Yu Zhang,Xiaoyi Zeng,Zhengxiao Liu,Zheng Lin,Jinxin Hu

类目:Information Retrieval (cs.IR)

关键词:large language model, Recent progress, substantially improve performance, based generative recommendation, language model

备注

点击查看摘要

Abstract:Recent progress in large language model (LLM) based generative recommendation (GR) shows that leveraging LLM world knowledge can substantially improve performance. However, existing methods rely on fixed, manually designed instructions to generate semantic knowledge and directly incorporate it into GR, which has two limitations. First, fixed instructions cannot capture the multidimensional heterogeneity of user interests. Second, uncontrollable knowledge fusion may conflict with behavioral signals and harm recommendations. To address these limitations, we propose LWGR, a framework that leverages Lagrangian constraints to transfer users' personalized world knowledge from LLMs into generative recommendation. LWGR enhances GR along two axes: knowledge extraction and fusion. It builds personalized soft instructions to extract behavior-relevant LLM world knowledge, and formulates knowledge fusion as an optimization problem with explicitly bounded performance degradation, which is solved by a Lagrangian primal-dual method to selectively incorporate beneficial knowledge. We further design two training strategies for different LLM scales and a deployment scheme that combines nearline precomputation with lightweight online serving. Experiments on multiple public datasets and one industrial dataset show that LWGR outperforms eight state-of-the-art baselines by up to 11.23% and brings a 1.35% revenue lift on a large-scale advertising platform, demonstrating its effectiveness and practicality.

21. 【2605.18770】Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI

链接https://arxiv.org/abs/2605.18770

作者:Arthur Capozzi,Dirk Helbing

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:present a collaborative, expert analysis, collaborative agentic GraphRAG, commercial registry data, collaborative agentic

备注

点击查看摘要

Abstract:We present a collaborative agentic GraphRAG framework for expert analysis of commercial registry data. Public registries are often formally accessible, yet difficult to use in practice because they combine structured records with large volumes of unstructured legal text. This limits conventional keyword and vector-only retrieval, especially for multi-hop, temporal, and entity-centric investigations. Our approach builds a Neo4j knowledge graph through a three-phase pipeline: (i) deterministic ingestion of strong nodes from verified structured fields, (ii) LLM-based extraction of weak nodes from unstructured notices, and (iii) deterministic identity resolution and deduplication. On top of this graph, we introduce an analytical modular agent that integrates zero-shot intent routing, a bounded reflection loop, secure tool-mediated graph access, and state-aware response synthesis. A human-in-the-loop dashboard exposes evidence and execution traces to support transparency and auditability. We evaluate the framework on the Swiss Official Gazette of Commerce, a multilingual corpus of more than seven million publications over seven years. We further contribute a multi-tier evaluation protocol covering entity-resolution precision, tool-routing behavior, answer quality, and multi-turn conversational performance. Across automated, human-curated, and conversational benchmarks, the proposed agentic GraphRAG system consistently outperforms a standard agentic vector-RAG baseline, with strong gains in correctness, answer relevance, information recall, turn success rate, and context carryover accuracy. The architecture is modular, reproducible, and transferable to other commercial gazettes and public-sector registry systems.

22. 【2605.18769】ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.18769

作者:Gibson Nkhata,Uttamasha Anjally Oyshi,Quan Mai,Susan Gauch

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Personalized Retrieval-Augmented Generation, accurately selecting user-relevant, selecting user-relevant documents, enhance personalized generation, Personalized Retrieval-Augmented

备注: 17 pages, 2 figures, to be published in the proceedings of ACL 2026

点击查看摘要

Abstract:Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user's profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

23. 【2605.18768】ClinQueryAgent: A Conversational Agent for Population Health Management

链接https://arxiv.org/abs/2605.18768

作者:Joseph S. Boyle,Anthony Dranfield,Mike O'Neil,Maria Liakata,Alison Q. Smithard

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词:external knowledge bases, executable database queries, introduce ClinQueryAgent, knowledge bases, translating natural language

备注: 11 pages, 4 figures. Submitted to ACL Systems Demonstrations

点击查看摘要

Abstract:In this paper we introduce ClinQueryAgent, a system for translating natural language population health questions into executable database queries using agents with access to both local and external knowledge bases. Our novel architecture enables the use of powerful cloud-based language models whilst ensuring that no patient data leaves the secure environment. To combat inaccuracies over the course of longer dialogues due to context rot, information retrieval is delegated to a sub-agent. We deploy the system via a chat window embedded within an existing population health management platform where it has been used by 128 staff from 15 healthcare practices covering a total of 148,319 patients in the UK's National Health Service (NHS). We evaluate the system's capacity to autonomously handle a range of health informatics tasks on a constructed dataset and via a beta-testing phase. Our results show that both analysts and clinicians are able to easily generate actionable information from patient health records using natural language requests requiring no programming expertise to verify. We make a public demo of the system available at: this https URL

24. 【2605.18767】DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking

链接https://arxiv.org/abs/2605.18767

作者:Litong Zhang,Jiaxin Li,Kuo Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:question answering requires, answering requires aggregating, Multi-hop question answering, requires aggregating information, knowledge-intensive applications

备注

点击查看摘要

Abstract:Multi-hop question answering requires aggregating information from multiple documents, a critical capability for knowledge-intensive applications. A fundamental challenge lies in efficiently identifying the minimal relevant document set from retrieved candidates while maintaining high recall. We present an efficient dual-view cascaded reranking framework for multi-hop document reranking. Operating as a lightweight post-retrieval stage over E5-base-v2 candidates, our architecture comprises: (1) a Local Scorer employing stacked cross-attention for fine-grained query-document relevance; and (2) a Global Scorer modeling inter-document dependencies via Transformer-based context aggregation. These views are dynamically fused through an Adaptive Gate conditioned on query semantics. Under the fixed candidate set reranking setting with offline cached embeddings, our model achieves competitive results, particularly outstanding on MuSiQue with 99.4% Top-4 Recall and 97.8% Full Hit accuracy at 4.0 ms latency (249 QPS). It substantially outperforms 600M-parameter cross-encoders (BGE-Large: 92.0% Recall, Jina-v3: 90.1% Recall) while maintaining 5 to 6 times lower latency. Ablation studies validate that both Local and Global views contribute substantially to multi-hop performance.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.18767 [cs.IR]

(or
arXiv:2605.18767v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2605.18767

Focus to learn more

              arXiv-issued DOI via DataCite</p>
25. 【2605.18766】Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

链接https://arxiv.org/abs/2605.18766

作者:Taehee Kim,Seungbin Yang,Jihwan Kim,Jaegul Choo

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieving relevant tables, accurately answering questions, Retrieving relevant, natural language query, natural language

备注: ACL 2026 Findings

点击查看摘要

Abstract:Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at this https URL.

26. 【2605.18765】STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation

链接https://arxiv.org/abs/2605.18765

作者:Shuai Li,Chen Huang,Duanyu Feng,Wenqiang Lei,See-Kiong Ng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:augment Large Language, Retrieval Augmented Generation, Large Language Models, Augmented Generation, Graph Retrieval Augmented

备注

点击查看摘要

Abstract:To augment Large Language Models (LLMs) for multi-hop question answering, a mainstream solution within Graph Retrieval Augmented Generation (GraphRAG) leverages lightweight retrievers to efficiently extract information from a given Knowledge Graph (KG). However, existing methods often overlook the inherent challenge of sparse semantic information in graphs. Specifically, our experiments reveal that these methods produce biased retrieval Semantic Shortcut Bias and Long-Tail Path Bias, leading to inadequate semantic modeling and limited GraphRAG effectiveness. To address these issues, we propose STAR, a semantic-tuned and tail-adaptive retriever for GraphRAG. STAR integrates two key learning paradigms: token-level interaction learning and path-weighted contrastive learning. The former employs a cross-attention architecture and a hard path mining mechanism to jointly model the query and path, thereby mitigating the Semantic Shortcut Bias. The latter introduces a tailored contrastive learning objective that utilizes tail-adaptive path weighting, designed to optimize the training process and ease the Long-Tail Path Bias. Extensive experiments demonstrate that STAR consistently outperforms baselines, achieving average retrieval performance gains of 1.8\% and LLM QA performance improvements of 2.2\% across all benchmark datasets. Our code is available at this https URL.

27. 【2605.18764】From Intent to AI Pipelines: A Controlled Agentic Framework for Non-AI Expert Scientists

链接https://arxiv.org/abs/2605.18764

作者:Hyacinth Ali,Jessie Galasso-Carbonnel,Houari Sahraoui

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Artificial Intelligence, large-scale data analysis, enabling large-scale data, Medical Sciences, Social Sciences

备注

点击查看摘要

Abstract:Artificial Intelligence (AI) pipelines have become integral to modern research, supporting fields such as Medical Sciences, Agriculture, and Social Sciences, and enabling large-scale data analysis, predictive modeling, and the automation of complex tasks. However, designing and implementing AI solutions remains challenging for many researchers due to the expertise required in the design and development of end-to-end AI systems. To address this gap, we present Domain-Driven Adaptable AI Pipelines (DDAP), a controlled, human-in-the-loop, agentic framework that leverages large language models to guide users in a systematic construction of AI pipelines and their corresponding implementation code. DDAP structures the development process into four stages: problem definition, compute environment specification, pipeline generation, and code generation. Through this staged interaction, the framework adapts to domain context, user expertise, and resource constraints, while maintaining user control over key decisions. We evaluate DDAP across multiple datasets spanning business, biology, and health science domains by comparing its AI models against expert-developed models. The experimental results show that DDAP achieves competitive results in several tasks compared to expert baselines, although performance varies across problem types, particularly for text-based clustering tasks. By combining guided interaction, adaptability, and reproducibility, DDAP demonstrates that a controlled agentic framework can generate competitive AI pipelines for non-expert users.

28. 【2605.18763】Query-Conditioned Graph Retrieval for Contextualized LLM Reasoning in Personalized Wearable Data

链接https://arxiv.org/abs/2605.18763

作者:Zhenyu Lu,Mahyar Abbasian,Amir M. Rahmani

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, analyzing wearable sensing, language models, increasingly applied

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to analyzing wearable sensing data, which are long-term, multimodal, and highly personalized. A key challenge is context selection: providing insufficient context limits reasoning, while including all available data leads to inefficiency and degraded generation quality. We propose Wearable As Graph (WAG), a graph-based context retrieval framework that enables query-adaptive reasoning over wearable data with LLMs. WAG organizes wearable metrics and user-specific signals into a personalized knowledge graph, and retrieves a query-conditioned subgraph to support downstream generation. The retrieval process integrates global relationships, capturing prior knowledge and population- and individual-level patterns via hierarchical Bayesian modeling, with local relationships that reflect short-term signal deviations. A query openness signal further controls retrieval breadth. We evaluate WAG on over 10,000 data-grounded queries from real-world wearable datasets. Across LLM-based and human evaluations, WAG achieves an approximately 70% win rate over baseline and standard RAG methods, demonstrating the effectiveness of structured, query-adaptive context retrieval for LLM-driven analysis of wearable data.

29. 【2605.18762】ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation

链接https://arxiv.org/abs/2605.18762

作者:Xingyu Lyu,Jianfeng He,Ning Wang,Yidan Hu,Tao Li,Danjue Chen,Shixiong Li,Yimin Chen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:augment large language, large language models, external knowledge retrieval, reliability and generalization, augment large

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely used to augment large language models with external knowledge retrieval to improve reliability and generalization. However, recent studies have shown that RAG systems remain vulnerable to data extraction attacks, where adversaries can extract private data by embedding malicious commands into user queries. Despite their feasibility, existing attacks typically suffer from low data extraction rates and limited practical effectiveness. Here, we propose ALDEN, a novel attack that effectively and efficiently extracts private data from RAGs. First, we employ active learning to diversify malicious queries and improve data extraction rates. Second, we observe that the data distribution of the underlying knowledge base provides valuable guidance for query generation and introduce a decay-based dynamic algorithm to estimate the corresponding topic distribution. By combining them together, we demonstrate that ALDEN substantially outperforms state-of-the-art methods through comprehensive evaluations.

30. 【2605.18760】DOTRAG: Retrieval-Time Reasoning Along Paths

链接https://arxiv.org/abs/2605.18760

作者:Larnell Moore,Naihao Deng,Rada Mihalcea,Farnaz Jahanbakhsh

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Graph Retrieval-Augmented Generation, Retrieval-Augmented Generation, retrieved using heuristics, Generation, Graph Retrieval-Augmented

备注

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) is dominated by a retrieve-then-reason paradigm, where context is retrieved using heuristics and then reasoned over. Such methods struggle to adapt to the query-specific logic required for complex multi-hop tasks, often accumulating irrelevant context or missing correct relational paths. We propose DotRAG, a training-free GraphRAG framework that reformulates retrieval as a reasoning process over paths. Our approach generates query-conditioned constraints that guide graph exploration, prune irrelevant regions, and iteratively discover relational paths without relying on explicit step-by-step reasoning chains. We introduce Division of Thought (DOT), an abstraction that decomposes retrieval into localized search spaces and adapts the search strategy to each query. DotRAG achieves SOTA performance on MetaQA and UltraDomain, with consistent gains on multi-hop tasks, demonstrating the effectiveness of reasoning-guided retrieval.

31. 【2605.17809】Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

链接https://arxiv.org/abs/2605.17809

作者:Chun-Hsiung Tseng,Hao-Chiang Koong Lin,Andrew Chih-Wei Huang,Yung-Hui Chen,Jia-Rou Lin

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:developing Artificial Intelligence, leveraging Large Language, Large Language Models, Artificial Intelligence, Large Language

备注

点击查看摘要

Abstract:This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.

计算机视觉

1. 【2605.20185】PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars

链接https://arxiv.org/abs/2605.20185

作者:Julian Kaltheuner,Jan Spindler,Sina Kitz,Patrick Stotko,Reinhard Klein

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing Gaussian avatar, Gaussian avatar methods, methods typically parameterize, Existing Gaussian, avatar methods typically

备注

点击查看摘要

Abstract:Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar's representation space with the template's deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.

2. 【2605.20183】MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

链接https://arxiv.org/abs/2605.20183

作者:Yujie Wei,Yujin Han,Zhekai Chen,Yongming Li,Kaixun Jiang,Zhihang Liu,Quanhao Li,Zhiwu Qing,Xiang Wang,Zhen Xing,Ruihang Chu,Lingyi Hong,Yefei He,Junjie Zhou,Junqiu Yu,Yang Shi,Difan Zou,Kai Zhu,Shiwei Zhang,Yingya Zhang,Yu Liu,Xihui Liu,Hongming Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:meet real-world demands, narratives to meet, real-world demands, rapidly evolving, evolving from single-shot

备注

点击查看摘要

Abstract:Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

3. 【2605.20177】From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

链接https://arxiv.org/abs/2605.20177

作者:Juncheng Wu,Hardy Chen,Haoqin Tu,Xianfeng Tang,Freda Shi,Hui Liu,Hanqing Lu,Cihang Xie,Yuyin Zhou

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, emphasize long, visual perception, reasoning, advances in vision-language

备注: 19 pages, 9 figures; Accepted to ICML 2026; Project Page: [this https URL](https://ucsc-vlaa.github.io/VLM-CapCurriculum/)

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

4. 【2605.20174】Multi-axis Analysis of Image Manipulation Localization

链接https://arxiv.org/abs/2605.20174

作者:Keanu Nichols,Divya Appapogu,Giscard Biamby,Dina Bashkirova,Anna Rohrbach,Bryan A. Plummer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:editing software enables, software enables easy, enables easy creation, image editing software, highly convincing image

备注: 28 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

5. 【2605.20165】CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

链接https://arxiv.org/abs/2605.20165

作者:Hsiang-Wei Huang,Junbin Lu,Kuang-Ming Chen,Jianxu Shangguan,Cheng-Yen Yang,Jenq-Neng Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gains reflect genuine, genuine spatial intelligence, reflect genuine spatial, spatial, question answering benchmarks

备注: Code and model available at [this https URL](https://github.com/hsiangwei0903/CaMo)

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at this https URL

6. 【2605.20159】Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

链接https://arxiv.org/abs/2605.20159

作者:Antonio Peña Corredor,Julien Lesseur,Romain Nunez,Paul Rivalland(SES),Thomas Philippe

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)

关键词:X-ray computed tomography, expert visual assessment, current workflows offering, workflows offering limited, offering limited traceability

备注

点击查看摘要

Abstract:Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

7. 【2605.20158】Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

链接https://arxiv.org/abs/2605.20158

作者:Guangzhi Xiong,Qiao Jin,Sanchit Sinha,Zhiyong Lu,Aidong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Vision Language, Vision Language Models, Large Vision, Vision Language, faithfully ground responses

备注

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at this https URL.

8. 【2605.20150】deGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

链接https://arxiv.org/abs/2605.20150

作者:Chonghao Zhong,Linfeng Shi,Hua Chen,Tiecheng Sun,Hao Zhao,Binhang Yuan,Chaojian Li

类目:Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

关键词:large attribute vector, exceeds GPU capacity, table quickly exceeds, quickly exceeds GPU, Gaussian primitive carries

备注: Accepted to ICML 2026 as Spotlight. Website: [this https URL](https://sponge-lab.github.io/TideGS)

点击查看摘要

Abstract:Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

9. 【2605.20147】PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

链接https://arxiv.org/abs/2605.20147

作者:Haojun Chen,Haoyang He,Chengming Xu,Qingdong He,Junwei Zhu,Yabiao Wang,Zhucun Xue,Xianfang Zeng,Zhennan Chen,Xiaobin Hu,Hao Zhao,Yong Liu,Jiangning Zhang,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently seen notable, notable progress, UHR image generation, UHR, image generation

备注: Project page is available at [this https URL](https://haojunchen663.github.io/projects/PixVerve/)

点击查看摘要

Abstract:Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

10. 【2605.20110】SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

链接https://arxiv.org/abs/2605.20110

作者:Zhixiong Zhang,Yizhuo Li,Shuangrui Ding,Yuhang Zang,Shengyuan Ding,Long Xing,Yibin Wang,Qiaosheng Zhang,Jiaqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Model, Large Vision Language, sets remains challenging, grounds natural-language queries, Previous Large Vision

备注

点击查看摘要

Abstract:Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 JF on MeViS and +12.4 JF on Ref-SeCVOS.

11. 【2605.20090】MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

链接https://arxiv.org/abs/2605.20090

作者:Zhiping Yu,Chenyang Liu,Jinqi Cao,Qinzhe Yang,Siwei Yu,Zhengxia Zou,Zhenwei Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal remote sensing, scarce in practice, Multi-modal remote, remote sensing, complete paired observations

备注

点击查看摘要

Abstract:Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at this https URL.

12. 【2605.20085】Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

链接https://arxiv.org/abs/2605.20085

作者:Yifan Li,Xinyu Zhou,Yunhao Ge,Yu Kong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Spatially Prompted, Robotic manipulation, Spatially Prompted Visual, language instructions, cluttered environments

备注

点击查看摘要

Abstract:Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

13. 【2605.20082】VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

链接https://arxiv.org/abs/2605.20082

作者:Zhefan Xu,Ghassen Jerfel,Marina Haliem,Qi Zhao,Jeonhyung Kang,Khaled S. Refaat

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:autonomous driving datasets, powerful motion forecasting, rapid growth, growth of autonomous, enabled the scaling

备注: Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

14. 【2605.20079】Probability-Conserving Flow Guidance

链接https://arxiv.org/abs/2605.20079

作者:Parsa Esmati,Junha Hyung,Amirhossein Dadashzadeh,Jaegul Choo,Majid Mirmehdi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:dominate visual synthesis, models dominate visual, Diffusion and flow-based, flow-based generative models, generative models dominate

备注

点击查看摘要

Abstract:Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

15. 【2605.20073】X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

链接https://arxiv.org/abs/2605.20073

作者:E O Rodrigues,L O Rodrigues,J J Lima,D Casanova,F Favarim,E R Dosciatti,V Pegorini,L S N Oliveira,F F C Morais

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:x-ray angiograms, work proposes, segmentation in x-ray, proposes a pixel-classification, vessel segmentation

备注

点击查看摘要

Abstract:This work proposes a pixel-classification approach for vessel segmentation in x-ray angiograms. The proposal uses textural features such as anisotropic diffusion, features based on the Hessian matrix, mathematical morphology and statistics. These features are extracted from the neighborhood of each pixel. The approach also uses the ELEMENT methodology, which consists of creating a pixel-classification controlled by region-growing where the result of the classification affects further classifications of pixels. The Random Forests classifier is used to predict whether the pixel belongs to the vessel structure. The approach achieved the best accuracy in the literature (95.48%) outperforming unsupervised state-of-the-art approaches.

16. 【2605.20064】Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

链接https://arxiv.org/abs/2605.20064

作者:Guilherme Santos da Silva,Dalcimar Casanova,Jefferson Tales Oliva,Erick Oliveira Rodrigues

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:coronary heart disease, increased adipose tissue, adipose tissue surrounding, heart disease, human heart

备注

点击查看摘要

Abstract:In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

17. 【2605.20044】OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

链接https://arxiv.org/abs/2605.20044

作者:Guiyu Liu,Niklas Vaara,Janne Mustaniemi,Juho Kannala,Janne Heikkilä

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hindering downstream tasks, lack inherent object-level, Gaussian Splatting, open-vocabulary scene understanding, primitives lack inherent

备注: Under review

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $\sigma^{*}$ for object-mask rendering. The original opacity $\sigma$ remains responsible for visual reconstruction, while $\sigma^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

18. 【2605.20035】Stage-adaptive Token Selection for Efficient Omni-modal LLMs

链接https://arxiv.org/abs/2605.20035

作者:Zijie Xin,Jie Yang,Ruixiang Zhao,Tianyi Wang,Fengyun Rao,Jing Lyu,Xirong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Omni-modal large language, large language models, unified audio-visual understanding, Omni-modal large, aligned token sequences

备注: Code Link: [this https URL](https://github.com/xxayt/SEATS)

点击查看摘要

Abstract:Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

19. 【2605.20033】A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

链接https://arxiv.org/abs/2605.20033

作者:Rohit Sinha,Kunal Tilaganji,Tanuja Ganu,Nagarajan Natarajan,Amit Sharma,Vineeth N. Balasubramanian

类目:Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)

关键词:Multimodal large language, Multimodal large, generate reasoning chains, large language models, incorrect answers

备注: ICLR 2026 Workshop VerifAI-2

点击查看摘要

Abstract:Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges' interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

20. 【2605.19995】CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

链接https://arxiv.org/abs/2605.19995

作者:Hongji Yang,Songlian Li,Yucheng Zhou,Xiaotong Zhao,Alan Zhao,Chengzhong Xu,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent diffusion models, achieve strong photorealism, models achieve strong, Recent diffusion, clay render conditions

备注

点击查看摘要

Abstract:Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: this https URL

21. 【2605.19990】Minimalist Visual Inertial Odometry

链接https://arxiv.org/abs/2605.19990

作者:Francesco Pasti,Jeremy Klotz,Nicola Bellotto,Shree K. Nayar

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:mobile robot navigation, number of pixels, critical to mobile, large number, Visual-Inertial Odometry

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

22. 【2605.19986】Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

链接https://arxiv.org/abs/2605.19986

作者:He-Yang Xu,Pengyuan Zhang,Zongyuan Ge,Xiaoshuai Hao,Serge Belongie,Xin Geng,Yuxin Peng,Xiu-Shen Wei

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:constraint-respecting motor execution, global scene context, local attribute grounding, high-fidelity spatial perception, longer suffices

备注: Project page: [this https URL](https://metafine.github.io/)

点击查看摘要

Abstract:Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: this https URL.

23. 【2605.19982】InterLight: Leveraging Intrinsic Illumination Priors for Low-Light Image Enhancement

链接https://arxiv.org/abs/2605.19982

作者:Ziqi Wang,Xu Zhang,Laibin Chang,Shi Chen,Jiaqi Ma,Huan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Light Image Enhancement, Low-Light Image, low-level vision, low contrast, Image Enhancement

备注: Accepted by IJCAI 2026. Code: [this https URL](https://github.com/House-yuyu/InterLight)

点击查看摘要

Abstract:Low-Light Image Enhancement (LLIE) has long been a challenging problem in low-level vision, as insufficient illumination often leads to low contrast, detail loss, and noise. Recent studies show that deep learning-based Retinex theory can effectively decouple illumination and reflectance. However, existing methods frequently suffer from over-enhancement or color distortion, and often assume uniform noise or ideal lighting. To address these limitations, we propose InterLight, a novel framework that systematically excavates and operationalizes intrinsic illumination priors for this http URL core insight is that robust enhancement requires not just estimating illumination, but constructing an illumination-aware pipeline. We first inject sensor-level illumination-response priors via physics-guided augmentation, then represent the degradation through adaptive prompts conditioned on the scene's latent illumination state. This explicit representation directly guides a luminance-gated intrinsic memory mechanism to selectively compensate for information loss, prioritizing reconstruction in dark regions while preserving fidelity in bright ones. Finally, the entire process is regularized by a self-supervised consistency objective that distills illumination-invariant features. By deeply exploiting intrinsic illumination priors, our method achieves clearer textures and more visually coherent enhancement results. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach. Code is available at: this https URL.

24. 【2605.19976】RECIPE: Procedural Planning via Grounding in Instructional Video

链接https://arxiv.org/abs/2605.19976

作者:Luigi Seminara,Antonino Furnari,Lorenzo Torresani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:partial video context, model to generate, generate the remaining, procedure in natural, natural language

备注

点击查看摘要

Abstract:Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

25. 【2605.19974】SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion

链接https://arxiv.org/abs/2605.19974

作者:Antoine Schnepf,Karim Kassab,Flavian Vasile,Andrew Comport

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly prevalent, growing adoption, adoption of virtual, virtual reality, fully immersive

备注: Accepted at ICML 2026. Project page available at [this https URL](https://sphericaldreamer.github.io)

点击查看摘要

Abstract:The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over long-range spatial extents and (ii) cover the complete omnidirectional field of view ($360^\circ$ horizontally and $180^\circ$ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D outdoor environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

26. 【2605.19957】World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

链接https://arxiv.org/abs/2605.19957

作者:Zuyao Lin,Jianhui Zhang,Peidong Jia,Xiaoguang Zhao,Shanghang Zhang,Xingyu Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:robot-centric instruction-conditioned dynamics, captures persistent instruction-agnostic, captures robot-centric instruction-conditioned, typically predict distinct, persistent instruction-agnostic scene

备注

点击查看摘要

Abstract:World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

27. 【2605.19956】owards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

链接https://arxiv.org/abs/2605.19956

作者:Jia-Wei Hai,Yijun Wang,Xiu-Shen Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved significant zero-shot, significant zero-shot performance, Vision-Language Models, achieved significant, significant zero-shot

备注: Accepted by ICML 2026, Project Page: this https, URL Code URL: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at this https URL .

28. 【2605.19950】AffectVerse: Emotional World Models for Multimodal Affective Computing

链接https://arxiv.org/abs/2605.19950

作者:Bo Zhao,Fanghua Ye,Yixin Ji,Sicheng Zhao,Xiaojiang Peng,Zitong YU

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans infer emotions, Humans infer, Emotion World Module, Humans, integrating observed multimodal

备注

点击查看摘要

Abstract:Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

29. 【2605.19949】Feed-Forward Gaussian Splatting from Sparse Aerial Views

链接https://arxiv.org/abs/2605.19949

作者:Dongli Wu,Zhuoxiao Li,Tongyan Hua,Yinrui Ren,Xiaobao Wei,Rongjun Qin,Wufan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing large-scale urban, Reconstructing large-scale, sparse aerial, challenging task, crucial yet challenging

备注

点击查看摘要

Abstract:Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

30. 【2605.19931】StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

链接https://arxiv.org/abs/2605.19931

作者:Reza M. Asiyabi,Juan Alberto Molina-Valero, TheSEOSAW Partnership,Steven Hancock,Casey M. Ryan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Estimating forest aboveground, Earth observation combines, forest aboveground biomass, incompatible label sources, structurally incompatible label

备注: 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

点击查看摘要

Abstract:Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

31. 【2605.19929】Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

链接https://arxiv.org/abs/2605.19929

作者:Yi Zhong,Haotong Qin,Xindong Zhang,Lei Zhang,Guolei Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deploying Vision-Language Models, Vision-Language Models, Low-bit post-training quantization, Low-bit post-training, resource-constrained devices

备注

点击查看摘要

Abstract:Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at this https URL

32. 【2605.19890】GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

链接https://arxiv.org/abs/2605.19890

作者:Shyma Alhuwaider,Yasmeen Alsaedy,Merey Ramazanova,Silvio Giancola,Bernard Ghanem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:TTA, unlabeled test stream, enables a pre-trained, distribution shift, memory

备注

点击查看摘要

Abstract:Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

33. 【2605.19889】GLUT: 3D Gaussian Lookup Table for Continuous Color Transformation

链接https://arxiv.org/abs/2605.19889

作者:Danna Xue,David Serrano-Lozano,Shaolin Su,Javier Vazquez-Corral

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Lookup Tables, storing large numbers, RGB space, discretizing the RGB, grid-based representation requires

备注: Project page: [this https URL](https://color.cvc.uab.cat/glut/)

点击查看摘要

Abstract:3D Lookup Tables (3D LUTs) are widely used for color mapping, but their grid-based representation requires discretizing the RGB space, leading to a capacity-memory trade-off that becomes prohibitive when storing large numbers of LUTs. Recent approaches adopt implicit neural representations to improve scalability, yet their black-box nature limits interpretability and hinders intuitive, localized editing. In this paper, we propose Gaussian LUT (GLUT), a continuous and explicit color representation that models color transformations using a set of learnable 3D Gaussian primitives. By avoiding fixed-resolution grids, GLUT achieves flexible representational capacity while maintaining a compact memory footprint. Its explicit, spatially localized formulation further enables both accurate modeling and interpretability. Building on this representation, we introduce a compact conditional generator (CGLUT) that predicts GLUT parameters for multiple LUT instances, encoding diverse color styles in a single framework to enable smooth and controllable LUT style blending. Moreover, GLUT supports efficient, user-friendly editing by allowing localized adjustments to specific color regions without global retraining. Experimental results demonstrate that our approach outperforms prior neural LUT representations in both accuracy and efficiency, while offering improved interpretability and interactive control.

34. 【2605.19876】Structural Energy Guidance for View-Consistent Text-to-3D Generation

链接https://arxiv.org/abs/2605.19876

作者:Qing Zhang,Jinguang Tong,Jing Zhang,Jie Hong,Xuesong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leading to inconsistent, models often suffers, inconsistent geometry, Structural Energy-Guided Sampling, Janus problem

备注: arXiv admin note: substantial text overlap with [arXiv:2508.16917](https://arxiv.org/abs/2508.16917)

点击查看摘要

Abstract:Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

35. 【2605.19869】Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

链接https://arxiv.org/abs/2605.19869

作者:Ananth Sriram,Neel Mokaria,Rajveer Singh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deadliest industry sector, United States, fatal worker injuries, worker injuries recorded, majority preventable

备注: 10 pages, 4 figures. First place, [this http URL](http://Ironsite.ai) Spatial Intelligence Hackathon, University of Maryland, February 2026. Code available at [this https URL](https://github.com/ananthsriram1/ironsite-hackathon-project-safety_assistant)

点击查看摘要

Abstract:Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

36. 【2605.19868】WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation

链接https://arxiv.org/abs/2605.19868

作者:Muhammad Ashad Kabir,Rabin Dulal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:monitor healing progression, diabetic foot ulcers, pressure injuries require, injuries require accurate, require accurate tissue-level

备注: 10 pages

点击查看摘要

Abstract:Chronic wounds such as diabetic foot ulcers and pressure injuries require accurate tissue-level assessment to guide treatment planning and monitor healing progression. While deep learning methods have advanced automated wound analysis, most existing approaches focus on binary segmentation and inadequately model heterogeneous tissue composition due to high intra-class variability and limited annotated data. Multi-class wound tissue segmentation, therefore, remains a challenging and clinically relevant problem. We propose WoundFormer, a transformer-based framework that enhances hierarchical spatial feature fusion for multi-class wound tissue segmentation. Specifically, we replace the standard SegFormer decoder with a spatially-preserving multi-scale aggregation head that maintains feature topology during cross-scale integration and strengthens contextual interactions through convolutional fusion. This design improves boundary localization and discrimination between visually similar tissue categories while preserving transformer efficiency. We evaluate WoundFormer on the WoundTissueSeg dataset (147 images, six tissue classes) and a second benchmark (DFUTissue dataset). The proposed method achieves an overall Dice score of 81.9%, outperforming strong CNN- and transformer-based baselines by up to 4.3 Dice points on the WoundTissueSeg benchmark, with consistent improvements across minority tissue classes. These results indicate that explicit modeling of hierarchical spatial interactions enhances transformer representations for heterogeneous wound tissue segmentation and supports more reliable quantitative wound assessment.

37. 【2605.19866】Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

链接https://arxiv.org/abs/2605.19866

作者:Peter El Hachem,Ahmed Nassar,A. Said Gurbuz,Christoph Auer,Peter W. J. Staar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:parse documents, frequently break, enclosing layout entity, Hop, Vision-Language Models

备注: 18 pages, 7 figures. Main text: 9 pages (4 figures); Appendix: 9 pages (3 figures)

点击查看摘要

Abstract:Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15\%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

38. 【2605.19865】Landscape-Awareness for Geometric View Diffusion Model

链接https://arxiv.org/abs/2605.19865

作者:Yan-Ting Chen,Hao-Wei Chen,Tsu-Ching Hsiao,Chun-Yi Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate camera viewpoint, conditions remains challenging, sparse-view conditions remains, Accurate camera, remains challenging

备注: CVPR2026

点击查看摘要

Abstract:Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.

39. 【2605.19859】Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

链接https://arxiv.org/abs/2605.19859

作者:Hengfei Wang,Anshul Gupta,Pierre Vuillecard,Jean-Marc Odobez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:general-purpose multimodal reasoners, strong zero-shot generalization, Vision-language models, gaze, gaze understanding

备注: Under review

点击查看摘要

Abstract:Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

40. 【2605.19855】A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

链接https://arxiv.org/abs/2605.19855

作者:Giacomo Astolfi,Matteo Bianchi,Riccardo Campi,Antonio De Santis,Marco Brambilla

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Explainable Artificial Intelligence, Concept-based Explainable Artificial, Artificial Intelligence, Explainable Artificial, interprets deep learning

备注: G. Astolfi, M. Bianchi, and R. Campi contributed equally

点击查看摘要

Abstract:Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at this https URL.

41. 【2605.19846】FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

链接https://arxiv.org/abs/2605.19846

作者:Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Hung-Ting Su,Winston H. Hsu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrated remarkable capabilities, real-world applications requiring, applications requiring nuanced, requiring nuanced interpretation, fine-grained comprehension crucial

备注: CVPR'26 (Workshop on Video Large Language Models)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

42. 【2605.19839】When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

链接https://arxiv.org/abs/2605.19839

作者:Weiyan Chen,Weijian Deng,Yao Xiao,Weijie Tu,ZiYi Dong,Ibrahim Radwan,Liang Lin,Pengxu Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:guide generative models, Preference alignment aims, aims to guide, guide generative, learning from comparisons

备注: ICML 2026 Camera Ready; Project Page: [this https URL](https://cwyxx.github.io/RealAlign)

点击查看摘要

Abstract:Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at this https URL.

43. 【2605.19837】CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

链接https://arxiv.org/abs/2605.19837

作者:Sherif Khairy,Catherine M. Elias

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词:degrades camera-based object, Adverse weather, degrades camera-based, autonomous vehicles, camera-based object detection

备注

点击查看摘要

Abstract:Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

44. 【2605.19824】From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

链接https://arxiv.org/abs/2605.19824

作者:Ahmed Y. Gado,Omar Y. Goba,Alaa Hassanein,Catherine M. Elias,Ahmed Hussein

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Large Language Models, Large Multimodal Models, Language Models, Multimodal Models, Autonomous Vehicles

备注

点击查看摘要

Abstract:Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

45. 【2605.19821】LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

链接https://arxiv.org/abs/2605.19821

作者:Jiaxin Wang,Muwei Jian,Hui Yu,Junyu Dong,Yifan Xia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial Expression Recognition, Expression Recognition, Facial Expression, variations in pose, challenging due

备注

点击查看摘要

Abstract:Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at this https URL.

46. 【2605.19804】Stitched Value Model for Diffusion Alignment

链接https://arxiv.org/abs/2605.19804

作者:Hyojun Go,Hyungjin Chung,Prune Truong,Goutam Bhat,Li Mi,Zhaochong An,Zixiang Zhao,Dominik Narnhofer,Serge Belongie,Federico Tombari,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:flow-based generative models, Monte Carlo estimates, Monte Carlo, aesthetic preference, flow-based generative

备注: Project page: [this https URL](https://gohyojun15.github.io/StitchVM/)

点击查看摘要

Abstract:For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

47. 【2605.19799】Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

链接https://arxiv.org/abs/2605.19799

作者:Tonghao Zhuang(1),Shanglong Hu(1),Yongsheng Luo(1),Zhiqi Zhang(1),Yu Li(1) ((1) Zhuhai College of Science and Technology, Zhuhai, China)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:cardiac ultrasound images, present a semi-supervised, semi-supervised framework, framework for joint, Normalized Surface Distance

备注: Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

点击查看摘要

Abstract:We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: this https URL.

Comments:
Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.19799 [cs.CV]

(or
arXiv:2605.19799v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.19799

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
48. 【2605.19797】Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

链接https://arxiv.org/abs/2605.19797

作者:Viktor Kocur,Sithu Aung,Gabrielle Flood,Yaqing Ding,Lukas Bujnak,Torsten Sattler,Zuzana Kukelova

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly powerful models, recent years, improved significantly, significantly in recent, powerful models

备注

点击查看摘要

Abstract:Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at this http URL.

49. 【2605.19792】Mechanisms of Object Localization in Vision-Language Models

链接https://arxiv.org/abs/2605.19792

作者:Timothy Schaumlöffel,Martina G. Vilas,Gemma Roig

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visually-grounded language models, Visually-grounded language, textual information, highly effective, effective in linking

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

50. 【2605.19786】Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

链接https://arxiv.org/abs/2605.19786

作者:Dvir Samuel,Yuval Atzmon,Gal Chechik,Yoni Kasten

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:methods remain slow, existing methods remain, remain slow, recently emerged, powerful paradigm

备注: [this https URL](https://research.nvidia.com/labs/par/fast4dmesh/)

点击查看摘要

Abstract:4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

Comments:
this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.19786 [cs.CV]

(or
arXiv:2605.19786v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.19786

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
51. 【2605.19776】Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

链接https://arxiv.org/abs/2605.19776

作者:Yuanpei Zhao,Jie Lin,Chao Zhang,Yilin Wang,Mao Li,Chenhui Li,Jie Hou,Tangjie Lv

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dominant annotation protocols, existing benchmarks adopt, image aesthetic assessment, leaving their complementarity, controlled conditions

备注: 27 pages, 7 pages

点击查看摘要

Abstract:Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

52. 【2605.19771】Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives

链接https://arxiv.org/abs/2605.19771

作者:Junli Wang,Zhihua Hua,Xueyi Liu,Zebin Xing,Haochen Tian,Kun Ma,Hangjun Ye,Guang Chen,Long Chen,Qichao Zhang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:minimizing geometric deviations, Existing imitation learning, Existing imitation, minimizing geometric, geometric deviations

备注

点击查看摘要

Abstract:Existing imitation learning methods for end-to-end autonomous driving predominantly learn from successful demonstrations by minimizing geometric deviations from expert trajectories. This paradigm implicitly assumes that spatial proximity implies behavioral safety, leading to a critical objective mismatch: trajectories with nearly identical imitation losses may exhibit drastically different safety outcomes, where one remains recoverable while the other results in collision. To address this limitation, we propose BeyondDrive, a failure-aware imitation learning framework that jointly learns from successful and failed driving behaviors. First, we introduce a flow matching-based negative trajectory generator that synthesizes safety-critical yet expert-proximate trajectories, enabling explicit modeling of safety asymmetry. Second, we develop a diversity-aware sampling strategy that mitigates mode collapse and improves coverage of diverse failure modes during negative trajectory generation. Third, we propose a Repulsive Distance Loss that simultaneously attracts predictions toward expert demonstrations while repelling them from hard negative trajectories, thereby establishing discriminative safety boundaries in trajectory space. Applied to the uni-modal baseline Latent TransFuser, BeyondDrive achieves 89.7 PDMS on the NAVSIMv1 closed-loop benchmark, outperforming prior state-of-the-art methods. Moreover, BeyondDrive generalizes effectively across different autonomous driving architectures, including multi-modal planners, and further demonstrates strong zero-shot transferability on the HUGSIM benchmark.

53. 【2605.19750】CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

链接https://arxiv.org/abs/2605.19750

作者:Junhao Li,Xinhao Zhong,Yi sun,Yuxia Qiao,Bin Chen,Shu-Tao Xia,Yaowei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual autoregressive, recently emerged, efficient paradigm, Visual, VAR

备注

点击查看摘要

Abstract:Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

54. 【2605.19744】Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

链接https://arxiv.org/abs/2605.19744

作者:Albert Schotschneider,Daniel Bogdoll,Svetlana Pavlitska,Ahmed Abouelazm,Johann Marius Zoellner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:data remains challenging, collecting representative anomalous, representative anomalous data, anomalous data remains, Detecting anomalies

备注: Accepted at CVPR 2026 Workshop AUTOPILOT-NA

点击查看摘要

Abstract:Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

55. 【2605.19739】FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

链接https://arxiv.org/abs/2605.19739

作者:Yi Sun,Zhiqi Zhang,Xinhao Zhong,Yimin Zhou,Shuoyang Sun,Bin Chen,Shu-Tao Xia,Ke Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:growing safety risks, safety risks due, Recent advances, flow matching models, introduce growing safety

备注

点击查看摘要

Abstract:Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emph{FlowErase-RL}, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbf{dynamic dual-path reward mechanism} that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.

56. 【2605.19737】Decentralized Direct Volume Rendering: A Browser-Native GPU Architecture for MRI Digital Twins in Resource-Constrained Settings

链接https://arxiv.org/abs/2605.19737

作者:Oserebameh Augustine Beckley

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:technology holds immense, holds immense potential, technology holds, personalized medicine, Digital Twins

备注: 10 pages, 4 figures. Live interactive browser demo available at: [this https URL](https://webgpu-mri.vercel.app/) . Source code repository: [this https URL](https://github.com/Bahdmanbabzo/webgpu-mri)

点击查看摘要

Abstract:Digital Twin (DT) technology holds immense potential for surgical planning and personalized medicine. However, generating interactive, patient-specific anatomical twins currently relies on computationally heavy Server-Side Rendering (SSR) or expensive local workstations, creating significant barriers to deployment, especially in resource-constrained settings (RCS). This paper presents a decentralized, client-side WebGPU architecture that democratizes access to high-fidelity anatomical Digital Twins. By bypassing standard server-side rendering pipelines, the framework executes deterministic single-pass raymarching and morphological gradient calculations directly on low-cost integrated edge GPUs. Eliminating the network latency inherent to cloud-rendered solutions, the system achieves a Time to First Pixel (TTFP) of under 920.0ms and maintains stable interactivity at = 82.0 FPS. Continuous Interaction Fidelity is maintained via uniform buffers, enabling zero-latency manipulation of tissue parameters for dynamic clinical decision-making. By proving that complex 3D medical simulations of patient-specific MRI scan can be executed natively in the browser without deep learning or external computational dependencies, this architecture provides a scalable, affordable foundation for the widespread clinical adoption of healthcare Digital Twins.

57. 【2605.19734】GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

链接https://arxiv.org/abs/2605.19734

作者:Tiantong Fang,Xiuwei Wang,Jing Xiao,Wujie Zhou,Liang Liao,Mi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables complementary observation, retrieval remains challenging, Multi-source remote sensing, sensing enables complementary, Multi-source remote

备注

点击查看摘要

Abstract:Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

58. 【2605.19729】LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

链接https://arxiv.org/abs/2605.19729

作者:Hyunsoo Han,Sangyeop Yeo,Jaejun Yoo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complex denoising process, substantially larger capacity, network highly complex, highly complex denoising, Adaptive Coefficient Estimation

备注: 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

点击查看摘要

Abstract:We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

59. 【2605.19728】Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

链接https://arxiv.org/abs/2605.19728

作者:Abdul Mohaimen Al Radi,Kunyang Li,Yuzhang Shang,Mubarak Shah,Yu Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce visually impressive, Foundation video models, visually impressive results, models produce visually, Foundation video

备注

点击查看摘要

Abstract:Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

60. 【2605.19727】ango3D: Towards Alignment for Global and Local 2D-3D Correspondence

链接https://arxiv.org/abs/2605.19727

作者:Zebin He,Mingxin Yang,Shuhui Yang,Hanxiao Sun,Xintong Han,Chunchao Guo,Wenhan Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models typically align, typically align point, strong cross-modal retrieval, frozen vision-language spaces, achieve strong cross-modal

备注

点击查看摘要

Abstract:Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

61. 【2605.19726】Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

链接https://arxiv.org/abs/2605.19726

作者:Wenhu Zhang,Yiming Wu,Huanyu Wang,Yaoyang Liu,Huanzhang Dou,Senqiao Yang,Sitong Wu,Hanbin Zhao,Jiaya Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enable globally coherent, traditional autoregressive LLMs, sequences remains costly, Diffusion Language Models, ultra-long sequences remains

备注: CVPR 2026 Findings paper

点击查看摘要

Abstract:Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

62. 【2605.19717】Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

链接https://arxiv.org/abs/2605.19717

作者:Elias Berger,Muhammad Usama,Jan Mehlstäubl,Bernhard Saske,Kristin Paetzold-Byhain

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, Language Models, lack physical comprehension, physical comprehension required

备注: Accepted in IJCAI-ECAI 2026 (Special Track on AI4Tech)

点击查看摘要

Abstract:Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

63. 【2605.19712】Physics-informed simulation framework for realistic sonar image generation and statistical validation

链接https://arxiv.org/abs/2605.19712

作者:Kamal Basha S,Athira Nambiar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:costly real-world acquisition, utility remains limited, rigorous quantitative validation, real-world acquisition, offer a scalable

备注

点击查看摘要

Abstract:Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

64. 【2605.19692】WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

链接https://arxiv.org/abs/2605.19692

作者:Satoshi Tsutsui,Winnie Pang,Shuting He,Bihan Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosing blood disorders, white blood cells, plays a fundamental, leukemia and anemia, white blood

备注: Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with [arXiv:2306.13531](https://arxiv.org/abs/2306.13531)

点击查看摘要

Abstract:The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{this https URL}}.

65. 【2605.19688】DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

链接https://arxiv.org/abs/2605.19688

作者:Kylian Ronfleux-Corail(L3I),Guillaume Bernard(L3I),Mickaël Coustaty(L3I),Nicolas Sidère(L3I)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models achieve strong, achieve strong performance, manipulation localization models, localization models achieve, operational document workflows

备注

点击查看摘要

Abstract:Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at this https URL. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

66. 【2605.19656】Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

链接https://arxiv.org/abs/2605.19656

作者:Matias Turkulainen,Akshay Krishnan,Filippo Aleotti,Mohamed Sayed,Guillermo Garcia-Hernando,Juho Kannala,Arno Solin,Gabriel Brostow,Daniyar Turmukhambetov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present Cross-View Splatter, pixel-aligned Gaussian splats, outdoor scenes captured, predicts pixel-aligned Gaussian, pixel-aligned Gaussian

备注: Submitted to CVPR 2026. 8 figures, 3 tables. Project page: [this https URL](https://nianticspatial.github.io/cross-view-splatter/)

点击查看摘要

Abstract:We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at this https URL.

67. 【2605.19649】CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

链接https://arxiv.org/abs/2605.19649

作者:Antoine Legrand,Renaud Detry,Christophe De Vleeschouwer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:estimation networks require, networks require tens, pose estimation networks, Spacecraft pose estimation, estimation networks

备注: (under review)

点击查看摘要

Abstract:Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

68. 【2605.19639】Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

链接https://arxiv.org/abs/2605.19639

作者:Junjie Wang,Xinghua Lou,Jason Li,Ye Tian,Keyu Chen,Yulin Li,Bin Kang,Jacky Mai,Yanwei Li,Zhuotao Tian,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified Multimodal Models, Unified Multimodal, achieved remarkable progress, Multimodal Models, achieved remarkable

备注

点击查看摘要

Abstract:Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at this https URL.

69. 【2605.19634】P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

链接https://arxiv.org/abs/2605.19634

作者:Kai Sheng,Liuyi Wang,Haojie Dai,Jinlong Li,Yongrui Qin,Zongtao He,Chengju Liu,Qijun Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ground natural-language instructions, executable navigation actions, requires an embodied, unseen environments, embodied agent

备注

点击查看摘要

Abstract:Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

70. 【2605.19631】HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

链接https://arxiv.org/abs/2605.19631

作者:Hoonhee Cho,Giwon Lee,Jae-Young Kang,Hyemin Yang,Heejun Park,Kuk-Jin Yoon

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:traditional modular pipelines, directly mapping raw, raw sensor data, mapping raw sensor, compelling alternative

备注

点击查看摘要

Abstract:End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

71. 【2605.19624】Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation

链接https://arxiv.org/abs/2605.19624

作者:Yonglong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:non-cooperative satellites depends, satellites depends heavily, reliable pose labels, acquire at scale, depends heavily

备注

点击查看摘要

Abstract:Monocular 6D pose estimation for non-cooperative satellites depends heavily on annotated training data, yet real satellite images with reliable pose labels and component-level masks are difficult to acquire at scale. Synthetic rendering can provide exact geometric annotations, but the appearance gap between rendered and real observations limits direct transfer to the real domain. This paper presents a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve satellite Sim2Real pose estimation in the considered calibrated setup while retaining simulation-derived geometric annotations.

72. 【2605.19623】PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

链接https://arxiv.org/abs/2605.19623

作者:Gabriele Rosi,Fabio Cermelli,Carlo Masone,Barbara Caputo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demands extensive pixel-level, extensive pixel-level annotations, Segmenting images, understanding but demands, demands extensive

备注: CVPR 2026 Findings. Code: [this https URL](https://github.com/FocoosAI/PrAda)

点击查看摘要

Abstract:Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

73. 【2605.19622】UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

链接https://arxiv.org/abs/2605.19622

作者:Congpei Qiu,Zhaoyu Hu,Wei Ke,Zhuotao Tian,Yanhao Wu,Tong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatially sensitive tasks, Vision Transformers, advanced rapidly, spatially sensitive, spurious tokens

备注: CVPR 2026

点击查看摘要

Abstract:Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

74. 【2605.19620】Bézier Degradation Modeling for LiDAR-based Human Motion Capture

链接https://arxiv.org/abs/2605.19620

作者:Xiaoqi An,Lin Zhao,Jun Li,Chen Gong,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurate motion reconstruction, driving and robotics, reconstruction is crucial, capture has broad, broad applications

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

75. 【2605.19613】White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

链接https://arxiv.org/abs/2605.19613

作者:Shuwei Li,Lei Tan,Robby T. Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:object colors consistent, Color constancy, Color constancy aims, consistent under varying, Color

备注: In CVPR 2026

点击查看摘要

Abstract:Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at this https URL.

76. 【2605.19611】Inverse Design of Metasurface based Absorbers using Physics Guided Conditional Diffusion Models

链接https://arxiv.org/abs/2605.19611

作者:Vineetha Joy,Jamshed Palai,Satwik Sahoo,Anshuman Kumar,Amit Sethi,Hema Singh

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:specific electromagnetic responses, electromagnetic responses requires, responses requires generating, satisfy stringent spectral, stringent spectral constraints

备注

点击查看摘要

Abstract:Inverse design of metasurfaces for specific electromagnetic responses requires generating geometries that satisfy stringent spectral constraints while maintaining manufacturability. Conventional design methodologies rely on iterative optimization routines using full wave simulations, which become extremely time consuming and computationally intensive for large design spaces. In addition, commonly employed generative approaches often exhibit limited conditional fidelity and the generated designs often contain fine or irregular features that are impractical to fabricate. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Here, the conditioning information consisting of target reflection characteristics is integrated into the model using feature wise linear modulation (FiLM). Furthermore, to enforce adherence to target spectra, a pre trained surrogate EM simulator is embedded into the framework introducing physics aware regularization through spectrum level loss functions. The efficiency of the proposed model is demonstrated by generating practically realizable metasurfaces for different types of reflection characteristics in the frequency range of 2 to 18 GHz. The proposed framework achieves an average spectral mean squared error of 0.0006 and band alignment accuracy of 0.958 between the target spectra and the spectra produced by the generated designs, demonstrating high conditional accuracy. In addition, the model generates multiple geometries for the same condition, thereby providing diverse design alternatives to the engineer. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.

77. 【2605.19607】Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

链接https://arxiv.org/abs/2605.19607

作者:Soyeon Kim,Seongwoo Lim,Kyowoon Lee,Jaesik Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:desirable axiomatic properties, satisfies desirable axiomatic, widely adopted feature, Spectral Integrated Gradients, Integrated Gradients

备注: 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

点击查看摘要

Abstract:Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at this https URL.

78. 【2605.19605】deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

链接https://arxiv.org/abs/2605.19605

作者:Ayushi Sharma,Clemens Mosig,Lukas Drees,Salim Soltani,Janusch Vajna-Jehle,Aaron Sheppard,Belqis Ahmadi,Jonathan Schmid,Paul Neumeier,Nathan Jacobs,Jan Dirk Wegner,Teja Kattenborn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tree cover, tree, creating an urgent, tree mortality, worldwide are increasingly

备注: Preprint. Under review. All rights reserved

点击查看摘要

Abstract:Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at this http URL.

79. 【2605.19595】A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

链接https://arxiv.org/abs/2605.19595

作者:João Pedro Matos-Carvalho,Laio Oriel Seman,Stefano Frizzo Stefenon,Mohammad Khalaf Mohammad Khreasat,Gabriel Villarrubia González

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:degraded insulation components, electrical power line, ensuring grid reliability, preventing failures caused, Unmanned Aerial Vehicles

备注

点击查看摘要

Abstract:The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

80. 【2605.19578】Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

链接https://arxiv.org/abs/2605.19578

作者:Mengyuan Liu,Ziyi Wang,Peiming Li,Junsong Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:RGB camera-based surveillance, camera-based surveillance systems, surveillance systems enable, systems enable human, RGB camera-based

备注: 15 pages, 9 figures,

点击查看摘要

Abstract:RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at this https URL.

81. 【2605.19559】EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

链接https://arxiv.org/abs/2605.19559

作者:Yang Dai,Dian Jiao,Tianwei Lin,Wenqiao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Large Language, track object state, fine-grained hand-object interactions

备注

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: this https URL.

82. 【2605.19556】EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

链接https://arxiv.org/abs/2605.19556

作者:Prateeth Rao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image pairs fundamentally, pairs fundamentally requires, geometrically consistent correspondences, Estimating relative pose, Estimating relative

备注: 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

点击查看摘要

Abstract:Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

83. 【2605.19554】Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

链接https://arxiv.org/abs/2605.19554

作者:Yue Yu,Haibo Chen,Shuo Chen,Jian Yang,Jun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires synthesized images, Instilling creativity, significant challenge, presents a significant, requires synthesized

备注

点击查看摘要

Abstract:Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

84. 【2605.19551】AnchorFlow: Editable SVG Reconstruction via Sparse Anchor Point Fields

链接https://arxiv.org/abs/2605.19551

作者:Mengnan Jiang,Christian Franke,Michele Franco Adesso,Antonio Haas,Grace Li Zhang

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:produce vector graphics, easy to edit, aims to produce, produce vector, vector graphics

备注

点击查看摘要

Abstract:Image-to-SVG reconstruction aims to produce vector graphics that are faithful to raster inputs and easy to edit. Existing methods face a structural trade-off in how vector structure is parameterized, including how many paths represent an image and how many anchor points define each path. High-fidelity methods often rely on many paths or densely parameterized curves, whereas overly compact SVG generation may deviate from the input geometry. This issue becomes more pronounced when local raster evidence is imperfect, where boundary-following reconstruction can introduce redundant anchors and fragmented structures. We argue that this trade-off should be addressed at the level of anchor placement, since anchors on Bezier curves define local path structure and strongly affect both accuracy and editability. We propose AnchorFlow, an editable SVG reconstruction framework that models path-level anchor placement with sparse anchor point fields. Given path-like foreground components extracted from a raster image, AnchorFlow predicts an image-conditioned sparse anchor field for each component and resolves it into an ordered Bezier path. Rendering-guided feedback then corrects local structural errors before re-resolution. The recovered paths are then assembled and optimized into the final SVG. Experiments on isolated paths and full images show that AnchorFlow achieves a favorable fidelity-editability trade-off, substantially reducing editable complexity while preserving competitive raster fidelity.

85. 【2605.19539】rust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

链接https://arxiv.org/abs/2605.19539

作者:Zihao Zhu,Wenyuan Zhao,Nuo Chen,Chao Tian,Zhiwen Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:foundation models hold, models hold promise, unconstrained dense geometry, dense geometry prediction, Geometric foundation models

备注: Accepted at ICML 2026. 10 pages main paper, with appendix

点击查看摘要

Abstract:Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at this https URL.

86. 【2605.19538】CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

链接https://arxiv.org/abs/2605.19538

作者:Pengcheng Wang,Haoxiang Liu,Yang Dai,Xiangxiang Zeng,Guanhua Chen,Baotian Hu,Longyue Wang,Weihua Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:human verification mechanisms, frequently block intelligent, block intelligent agents, real-world web environments, agents from completing

备注: 17 pages, 12 figures

点击查看摘要

Abstract:CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

87. 【2605.19533】Replacement Learning: Training Neural Networks with Fewer Parameters

链接https://arxiv.org/abs/2605.19533

作者:Yuming Zhang,Peizhe Wang,Tianyang Han,Hengyu Shi,Junhao Su,Dongzhi Guan,Jiabin Liu,Jiaji Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, models grow deeper, optimizing deep neural, full-depth backpropagation remains, neural networks

备注: 16pages

点击查看摘要

Abstract:End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

88. 【2605.19532】Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

链接https://arxiv.org/abs/2605.19532

作者:Yunzhe Zhang,Hongfu Liu,Pengyu Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:synthesize high-quality images, yield large variations, high-quality images, models can synthesize, synthesize high-quality

备注: Preprint

点击查看摘要

Abstract:Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

89. 【2605.19528】owards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

链接https://arxiv.org/abs/2605.19528

作者:Xueying Jiang,Wenhao Li,Quanhao Qian,Deli Zhao,Shijian Lu,Gongjie Zhang,Ran Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, localization in Multimodal

备注

点击查看摘要

Abstract:3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

90. 【2605.19527】Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

链接https://arxiv.org/abs/2605.19527

作者:Zhangjian Ji,Shaotong Qiao,Kai Feng,Wei Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple camera views, matching partially visible, Dual Prompt Learning, person re-identification focuses, partially visible pedestrians

备注

点击查看摘要

Abstract:Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at this https URL.

91. 【2605.19524】SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

链接https://arxiv.org/abs/2605.19524

作者:Kefei Tian,Yuansheng Lian,Kai Yang,Xiangdong Chen,Shen Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:safety-critical long-tail cases, driving systems excel, long-tail cases, systems excel, excel in common

备注

点击查看摘要

Abstract:End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

92. 【2605.19523】Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

链接https://arxiv.org/abs/2605.19523

作者:Zhiyu Xu,Lean Wang,Yuanxin Liu,Lei Li,Hao Zhou,Fandong Meng,Jie Zhou,Xu Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:general multi-modal understanding, demonstrated remarkable proficiency, efficiently acquire continually, acquire continually evolving, continually evolving domain-specific

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

93. 【2605.19522】Diff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

链接https://arxiv.org/abs/2605.19522

作者:Xinli Yue,JianHui Sun,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:professional photography requires, image quality assessment, Pairwise image quality, Answer Model, professional photography

备注: Accepted to CVPR 2026 Workshop

点击查看摘要

Abstract:Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

94. 【2605.19511】Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

链接https://arxiv.org/abs/2605.19511

作者:Xiaodong Wu,Qi Li,Xiangman Li,Zelin Zhang,Lingshuang Liu,Jianbing Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:watermarked images remain, images remain editable, compromising watermark integrity, underexplored question, paper investigates

备注

点击查看摘要

Abstract:This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor's training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

95. 【2605.19510】Return of Frustratingly Easy Unsupervised Video Domain Adaptation

链接https://arxiv.org/abs/2605.19510

作者:Pengfei Wei,Yiqun Sun,Zhiqiang Xu,Yiping Ke,Lawrence B. Hsieh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unsupervised video domain, Unsupervised video, under-explored problem, practical but under-explored, Unsupervised

备注: To appear in ICML 2026

点击查看摘要

Abstract:Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

96. 【2605.19506】EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

链接https://arxiv.org/abs/2605.19506

作者:Pengtao Ma,Ziliang Zhou,Ciyu Ruan,Haoyang Wang,Kaiyuan Li,Zihang Gong,Wenhua Ding,Chen Gao,Jingao Xu,Xinlei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformer-based Video-LLMs makes, Video-LLMs makes dense, precise geometric structure, cost of Transformer-based, Transformer-based Video-LLMs

备注

点击查看摘要

Abstract:First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

97. 【2605.19491】hinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

链接https://arxiv.org/abs/2605.19491

作者:Jiusong Ge,Yingkang Zhan,Wenjie Zhao,Di Zhang,Ke Wang,Jiashuai Liu,Chunze Yang,Chengzu Li,Jian Zhang,Yuxin Dong,Ni Zhang,Qidong Liu,Mireia Crispin-Ortuzar,Huazhu Fu,Chen Li,Zeyu Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple instance learning, Traditional whole slide, extracts patch-level features, instance learning, slide-level prediction

备注: Accepted to ICML 2026

点击查看摘要

Abstract:Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at this https URL.

98. 【2605.19490】Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

链接https://arxiv.org/abs/2605.19490

作者:Kanglong Quan,Zhebing Xia,Linfeng Jiang,Hao Yu,Ziheng Qiao,Dapeng Dong,Dongyao Jia

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Comprehensive and efficient, efficient validation, validation of connected, connected and automated, real-world deployment

备注

点击查看摘要

Abstract:Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

99. 【2605.19484】CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

链接https://arxiv.org/abs/2605.19484

作者:Haobo Hu,Xiangwu Guo,Zhiheng Chen,Difei Gao,Haotian Liu,Libiao Jin,Qi Mao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

关键词:made significant progress, basic operating system, remain largely underexplored, operating system tasks, largely underexplored

备注

点击查看摘要

Abstract:While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our this http URL current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

100. 【2605.19478】Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

链接https://arxiv.org/abs/2605.19478

作者:Zeyao Liu,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Yuexin Xuan,Xiaoshuang Ji

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing ViT backdoor, Existing ViT, inflict performance degradation, ViT backdoor attacks, backdoor attacks based

备注

点击查看摘要

Abstract:Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic and context-aware architectures can facilitate a far more dangerous and emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a formidable ``hostage" dilemma, as pruning the attack necessarily destroys the benign performance. Comprehensive evaluations show VIPER effectively addresses the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.

101. 【2605.19446】argeted Downstream-Agnostic Attack

链接https://arxiv.org/abs/2605.19446

作者:Zhuxin Lei,Ziyuan Yang,Yi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:representation extraction, gained widespread, strong capability, capability in representation, DAA methods

备注

点击查看摘要

Abstract:Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

102. 【2605.19436】CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

链接https://arxiv.org/abs/2605.19436

作者:Ahmed Heakl,Abdelrahman M. Shaker,Youssef Mohamed,Rania Elbadry,Omar Fetouh,Fahad Shahbaz Khan,Salman Khan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:correct answer favor, correct answer, verifiable rewards, solution under reinforcement, reinforcement learning

备注: 9 pages

点击查看摘要

Abstract:When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at this https URL.

103. 【2605.19435】KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

链接https://arxiv.org/abs/2605.19435

作者:Maya Yanko,Yoli Shavit

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Place Recognition, Visual Place, Place Recognition, autonomous navigation, critical for autonomous

备注

点击查看摘要

Abstract:Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: this https URL

104. 【2605.19410】Vision Harnessing Agent for Open Ad-hoc Segmentation

链接https://arxiv.org/abs/2605.19410

作者:Zilin Wang,Stella X. Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:open ad-hoc concepts, open ad-hoc, open ad-hoc segmentation, VASA, requiring retrieval

备注: 23 pages, 11 figures

点击查看摘要

Abstract:Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

105. 【2605.19398】Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

链接https://arxiv.org/abs/2605.19398

作者:Wooseok Jeon,Seungho Park,Seunghyun Shin,Sangeyl Lee,Hyeonho Jeong,Hae-Gon Jeon

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remain overly static, overly static, generate videos, videos that remain, remain overly

备注: Preprint

点击查看摘要

Abstract:Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

106. 【2605.19393】Neuron Incidence Redistribution for Fairness in Medical Image Classification

链接https://arxiv.org/abs/2605.19393

作者:Abin Shoby,Lyle John Palmer,Nikhil Cherian Kurian

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep learning models, subgroup performance disparities, Deep learning, medical image classification, Neuron Incidence Redistribution

备注: 4 Pages, 1 Figure

点击查看摘要

Abstract:Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

107. 【2605.19390】LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

链接https://arxiv.org/abs/2605.19390

作者:Chaoyue Li,Yongxue Xu,Jie Feng,Jiayu Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent large multimodal, large multimodal models, Recent large, continuous spatiotemporal dynamic, Time Geometry Encoding

备注

点击查看摘要

Abstract:Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at this https URL.

108. 【2605.19386】MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

链接https://arxiv.org/abs/2605.19386

作者:Yang Yang,Yiyan Wang,Zheming Liu,Naoya Iwamoto

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing simulation-ready deformable, Reconstructing simulation-ready, simulation-ready deformable objects, important for vision, simulation-ready deformable

备注: Submitted to Siggrah Asia 2026

点击查看摘要

Abstract:Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

109. 【2605.19378】Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

链接https://arxiv.org/abs/2605.19378

作者:Haiying Sha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video Diffusion Transformers, Diffusion Transformers, paper systematically diagnoses, video Diffusion, training failure modes

备注

点击查看摘要

Abstract:This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.19378 [cs.CV]

(or
arXiv:2605.19378v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.19378

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
110. 【2605.19374】Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

链接https://arxiv.org/abs/2605.19374

作者:Chenyu Lian,Hong-Yu Zhou,Chun-Ka Wong,Jing Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:chest X-ray findings, chest X-rays, chest X-ray, Vision-language alignment, X-ray findings

备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at this https URL.

111. 【2605.19371】Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

链接https://arxiv.org/abs/2605.19371

作者:Jun Ma,Hanquan Zhang,Yanjun Qin,Haoyuan Guan,Ke Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion models, Dissipation Flow Matching, Flow Matching, relying on noise-based, Diffusion

备注

点击查看摘要

Abstract:Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

112. 【2605.19360】Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

链接https://arxiv.org/abs/2605.19360

作者:Parnian Ghapandar Kashani,Shiqi Chen,Aydogan Ozcan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph); Optics (physics.optics)

关键词:AI-generated visual media, rapid proliferation, visual media, media has created, created an urgent

备注: 30 Pages, 8 Figures

点击查看摘要

Abstract:The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

113. 【2605.19359】MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

链接https://arxiv.org/abs/2605.19359

作者:Halil Ibrahim Gulluk,Olivier Gevaert

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep learning methods, demonstrated promising results, Deep learning, predicting BI-RADS scores, methods have demonstrated

备注

点击查看摘要

Abstract:Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: this https URL

114. 【2605.19355】Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance

链接https://arxiv.org/abs/2605.19355

作者:Soojin Choi,Seokhyeon Hong,Chaelin Kim,Junghyun Nam,Junhyuk Jeon,Junyong Noh

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:varying body shapes, challenging problem, self-contact and near-body, varying body, body shapes

备注: SIGGRAPH 2026 / ACM TOG. Project page available at [this https URL](https://suzyn.github.io/space_page/)

点击查看摘要

Abstract:Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self-contact and near-body proximity, remains a challenging problem. While recent geometry-aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry-aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer-based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose-dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction-aware retargeting. Conditioned on these anchors, a graph-based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task-aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state-of-the-art approaches in preserving interaction fidelity across diverse character geometries.

115. 【2605.19342】Semantic-Enriched Latent Visual Reasoning

链接https://arxiv.org/abs/2605.19342

作者:Tianrun Xu,Yue Sun,Qixun Wang,Jingyi Lu,Yuan Wang,Tianren Zhang,Longteng Guo,Fengyun Rao,Jing Lyu,Feng Chen,Jing Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal latent-space reasoning, replace explicit thinking, Multimodal latent-space, compact latent space, latent-space reasoning aims

备注

点击查看摘要

Abstract:Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

116. 【2605.19340】Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

链接https://arxiv.org/abs/2605.19340

作者:Junyuan Ma,Xunzhi Xiang,Wenbin Li,Qi Fan,Yang Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision foundation models, achieved strong performance, vision tasks, Vision foundation, achieved strong

备注: 20 pages, 11 figures, 13 tables. Accepted to CVPR 2026

点击查看摘要

Abstract:Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

117. 【2605.19329】RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

链接https://arxiv.org/abs/2605.19329

作者:Hanqing Liu,Mingjie Liu,Luoping Cui,Endian Lin,Donghong Jiang,Chuang Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Conventional vision-language models, standard RGB images, RGB images degrade, interpret scenes captured, high dynamic range

备注: 10 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at this https URL.

118. 【2605.19322】DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

链接https://arxiv.org/abs/2605.19322

作者:Minyoung Park,Taehun Kong,Sangjun Ahn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Video Large Language, Language Models, Large Language, greatly expanded multimodal

备注

点击查看摘要

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

119. 【2605.19320】xtAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

链接https://arxiv.org/abs/2605.19320

作者:Mingxuan Cui,Jingpu Yang,Fengxian Ji,Qian Jiang,Zhecheng Shi,Jiaming Wang,Zirui Song,Fajri Koto,Xiuying Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:fine-grained glyph-level structure, Faithful text rendering, weakness of large, glyph-level structure, Faithful text

备注

点击查看摘要

Abstract:Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

120. 【2605.19319】SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

链接https://arxiv.org/abs/2605.19319

作者:Yiren Song,Yihan Wang,Xiyao Deng,Zhuoran Yan,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image editing, Visual, editing, image, image editing models

备注

点击查看摘要

Abstract:Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

121. 【2605.19307】MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

链接https://arxiv.org/abs/2605.19307

作者:Quanxing Xu,Yuhao Tian,Ling Zhou,Xian Zhong,Xiaohua Huang,Rubing Huang,Chia-Wen Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Question Answering, Large Language Models, Multimodal Large Language, Question Answering, Large Language

备注

点击查看摘要

Abstract:Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

122. 【2605.19305】Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes

链接https://arxiv.org/abs/2605.19305

作者:Tianshu Kuai,Arman Maesumi,Daniel Ritchie,Noam Aigerman

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:meaning the trained, paper tackles, triangulation-agnostic manner, triangulations effectively, trained model

备注: In ACM Transactions on Graphics (SIGGRAPH 2026). Project page: [this https URL](https://matern-fm.github.io/)

点击查看摘要

Abstract:This paper tackles the task of learning to generate signals over triangle meshes in a triangulation-agnostic manner, meaning the trained model can be applied to different meshes and triangulations effectively. Practically, the paper adapts the flow matching (FM) paradigm to a mesh-based, triangulation-agnostic setting. Theoretically, it proposes a specific noise distribution which is triangulation agnostic, to be used inside the FM model's denoising process. While noise distributions are usually trivial to devise for, e.g., images, devising a triangulation-agnostic distribution proves to be a much more difficult task. We formulate a mathematical definition of triangulation agnosticism of distributions, via their spectrum. We then show that a discretization of a specific Gaussian random field called a Matérn process holds these desired properties, and provides a simple and efficient sampling algorithm. We use it as our noise model, and adapt FM to the triangulation-agnostic setting by using a state-of-the-art approach for learning signals on meshes in the gradient domain -- PoissonNet -- as the denoiser. We conduct experiments on elaborate tasks such as sampling elastic rest states, and generating poses of humanoids. Our method is shown to be capable of producing highly realistic results for meshes of over one million triangles, significantly exceeding the state-of-the-art in quality and diversity.

123. 【2605.19304】MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

链接https://arxiv.org/abs/2605.19304

作者:Beizhen Zhao,Sicheng Yu,Ziran Yin,Dongxu Shen,Hao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:significant overhead due, Gaussian Splatting, suffers from significant, significant overhead, overhead due

备注: 19 pages

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian's distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf{10$\%$} primitives and \textbf{10$\times$} accelerated training speeds compared to vanilla 3DGS.

124. 【2605.19301】GSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

链接https://arxiv.org/abs/2605.19301

作者:Xuezhi Cui,Dongbo Zhou,Wang Guo,Zeyuan Wang,Ziyu Li,Gaozhi Zhou,Xian Li,Ling Zhao,Wentao Yang,Chao Tao,Haifeng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-Language Models require, Models require efficient, continually emerging downstream, Vision-Language Models, emerging downstream tasks

备注

点击查看摘要

Abstract:Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at this https URL.

125. 【2605.19289】What Makes Synthetic Data Effective in Image Segmentation

链接https://arxiv.org/abs/2605.19289

作者:Jinjin Zhang,Xiefan Guo,Yizhou Jin,Nan Zhou,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Driven by rapid, large-scale generative models, rapid advances, advances in large-scale, large-scale generative

备注: Accepted to ICML 2026

点击查看摘要

Abstract:Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at this https URL.

126. 【2605.19279】FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

链接https://arxiv.org/abs/2605.19279

作者:Yudan Ren,Pengcheng Shi,Zihan Ma,Xiaowei He,Xiao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, functional Magnetic Resonance, Resonance Imaging, Magnetic Resonance, advanced brain-computer interfaces

备注: 15 pages,4 figures

点击查看摘要

Abstract:Visual image reconstruction from functional Magnetic Resonance Imaging (fMRI) is a fundamental task in brain decoding, providing a crucial pathway for understanding human perceptual mechanisms and developing advanced brain-computer interfaces (BCIs). However, most current methods simply flatten fMRI signals from localized visual cortices into one-dimensional (1D) vectors, mapping them directly into latent spaces such as that of Contrastive Language-Image Pre-training (CLIP). This paradigm not only disrupts the inherent network topology of the brain-leading to limited neuroscientific interpretability-but also overlooks the synergistic contributions of other distributed functional networks in processing high-level visual semantics. To address these limitations, we propose FPED, a Functional-Network Prior-Guided Mixture of Experts (MoE) framework for interpretable brain decoding. FPED explicitly models different functional brain networks as specialized experts and employs adaptive routing to capture their complementary contributions to visual semantic understanding. Unlike conventional homogeneous decoding paradigms, our framework incorporates neurobiologically grounded priors to enable structured and interpretable network-level representation learning. Experimental results demonstrate that FPED achieves highly competitive semantic reconstruction performance with only 0.68B parameters. The learned routing dynamics reveal biologically meaningful correspondence between functional brain networks and modality-specific semantic processing, providing transparent neuroscientific interpretability. This suggests that brain network-aware expert modeling is a promising direction for bridging neural decoding and biologically inspired artificial intelligence.

127. 【2605.19260】AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

链接https://arxiv.org/abs/2605.19260

作者:Yuankai Li,Tinghui Zhu,Ha Min Son,Zhe Zhao,Xin Liu,Muhao Chen

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:Large Multimodal Models, Large Multimodal, Multimodal Models, high-resolution GUI screenshots, GUI

备注

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

128. 【2605.19256】Distribution Matching Distillation without Fake Score Network

链接https://arxiv.org/abs/2605.19256

作者:Youngjoong Kim,Deokyeong Lee,Jaesik Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Distribution Matching Distillation, evolving generative distribution, Matching Distillation, Distribution Matching, generative distribution

备注

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

129. 【2605.19247】Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

链接https://arxiv.org/abs/2605.19247

作者:Yuiko Sakuma,Masakazu Yoshimura,Marcel Gröpl,Zitang Sun,Junji Otsuka,Atsushi Irie,Takeshi Ohashi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assisted NAS methods, restrictive search spaces, NAS methods enable, NAS methods, assisted NAS

备注: 42 pages

点击查看摘要

Abstract:Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

130. 【2605.19242】PhyWorld: Physics-Faithful World Model for Video Generation

链接https://arxiv.org/abs/2605.19242

作者:Pu Zhao,Juyi Lin,Timothy Rupprecht,Arash Akbari,Chence Yang,Rahul Chowdhury,Elaheh Motamedi,Arman Akbari,Yumei He,Chen Wang,Geng Yuan,Weiwei Chen,Yanzhi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:real-world deployment, provide safe, safe and scalable, scalable environments, environments for training

备注

点击查看摘要

Abstract:World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

131. 【2605.19230】Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

链接https://arxiv.org/abs/2605.19230

作者:Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lyle J. Palmer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:linking imaging morphology, medical image classification, Age, disease prevalence, linking imaging

备注: 10 Pages, 3 Figures

点击查看摘要

Abstract:Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

132. 【2605.19223】HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

链接https://arxiv.org/abs/2605.19223

作者:Mengqi Shi,Haopeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, exhibit strong performance

备注

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

133. 【2605.19218】Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

链接https://arxiv.org/abs/2605.19218

作者:Beomseok Kang,Dongwon Jo,Jiwon Song,Donghwee Son,Jae-Joon Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-Language Models suffer, Models suffer severe, Vision-Language Models, Models suffer, Key channel pruning

备注

点击查看摘要

Abstract:Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

134. 【2605.19214】Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

链接https://arxiv.org/abs/2605.19214

作者:Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lauren Oakden-Rayner,Robert Vandersluis,Jessica Schrouff,Lyle J. Palmer,Mark Jenkinson

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:mask clinically important, clinically important disparities, false positive rates, varies systematically, mask clinically

备注: 11 Pages, 2 Figures

点击查看摘要

Abstract:Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

135. 【2605.19213】Smartphone-based Circular Plot Sampling for Forest Inventory

链接https://arxiv.org/abs/2605.19213

作者:Su Sun,Jui-Cheng Chiu,Nabin Khanal,Songlin Fei,Yingjie Victor Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Circular sample plots, plots remains challenging, Circular sample, breast height, remains challenging

备注

点击查看摘要

Abstract:Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.19213 [cs.CV]

(or
arXiv:2605.19213v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.19213

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
136. 【2605.19210】D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

链接https://arxiv.org/abs/2605.19210

作者:Shengzhe Chen,Hao Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental geometric prior, man-made structures, trainable segmentation networks, fundamental geometric, underlies many natural

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

137. 【2605.19207】Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

链接https://arxiv.org/abs/2605.19207

作者:Sumanth Meenan Kanneti,Aryan Shah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:medical image analysis, low-resource clinical environments, clinical environments remains, environments remains difficult, remains difficult due

备注

点击查看摘要

Abstract:Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2605.19207 [cs.CV]

(or
arXiv:2605.19207v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.19207

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
138. 【2605.19155】Efficient coding along the visual hierarchy

链接https://arxiv.org/abs/2605.19155

作者:Ananya Passi,Brian S. Robinson,Michael F. Bonner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual systems learn, systems learn, rely on millions, millions of training, unlike deep learning

备注: 34 pages, 6 figures

点击查看摘要

Abstract:Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

139. 【2605.19137】owards Data-Efficient Video Pre-training with Frozen Image Foundation Models

链接https://arxiv.org/abs/2605.19137

作者:Svetlana Orlova,Niccolò Cavagnero,Gijs Dubbelman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image foundation model, massive video datasets, foundation models achieve, image foundation, Video foundation models

备注: Accepted to CVPR 2026 Workshops CV4Smalls

点击查看摘要

Abstract:Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: this https URL .

140. 【2605.19133】Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

链接https://arxiv.org/abs/2605.19133

作者:Muskaan Chopra,Lorenz Sparrenberg,Jan H. Terheyden,Rafet Sifa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pretrain medical image, Self-supervised learning, medical image models, SSL, pretrain medical

备注: Accepted at IJCAI 2026

点击查看摘要

Abstract:Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

141. 【2605.19130】EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

链接https://arxiv.org/abs/2605.19130

作者:Dongyan Lin,Phillip Rust,Angel Villar Corrales,Alvin W. M. Tan,Mahi Luthra,Charles-Éric Saint-James,Rashel Moritz,Sheila Krogh-Jespersen,Vanessa Stark,Surya Parimi,Jiayi Shen,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Tom Fizycki,Nicolas Hamilakis,Manel Khentout,Sho Tsuji,Balázs Kégl,Juan Pino,Michael C. Frank,Emmanuel Dupoux

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Children acquire language, Children acquire, limited visuo-linguistic input, acquire language grounding, large multimodal models

备注

点击查看摘要

Abstract:Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

142. 【2605.19111】FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

链接https://arxiv.org/abs/2605.19111

作者:Youngsun Lim,Cusuh Ham,Pin-Yu Chen,Deepti Ghadiyaram

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:information explicitly stated, capture factual requirements, generated images align, align with information, information explicitly

备注: It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

点击查看摘要

Abstract:Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

143. 【2605.19075】CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

链接https://arxiv.org/abs/2605.19075

作者:Mahesh Bhosale,Abdul Wasi,Vishvesh Trivedi,Pengyu Yan,Akhil Gorugantu,David Doermann

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Grounded multi-video question, multi-video question answering, heterogeneous video archives, Grounded multi-video, events requires systems

备注: Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

点击查看摘要

Abstract:Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at this https URL.

144. 【2605.19074】Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

链接https://arxiv.org/abs/2605.19074

作者:Sumit Laha,Ankit Sharma,Hassan Foroosh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rapid global expansion, grid instability caused, solar photovoltaic, solar irradiance, expansion of solar

备注

点击查看摘要

Abstract:The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

145. 【2605.19060】LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

链接https://arxiv.org/abs/2605.19060

作者:Xinhe Zhang,Yuyang Zhang,Pengfei Jin,Arnau Marin-Llobet,Na Li,Quanzheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:generation remains challenging, fully volumetric models, image generation remains, computationally expensive, remains challenging

备注

点击查看摘要

Abstract:High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

146. 【2605.19033】RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

链接https://arxiv.org/abs/2605.19033

作者:Ehsan Ahmadi,Hunter Schofield,Behzad Khamidehi,Fazel Arasteh,Jinjun Shan,Lili Mou,Dongfeng Bai,Kasra Rezaee

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Supervised open-loop training, multi-agent interactions common, Supervised open-loop, complex driving scenarios, Open Motion Dataset

备注: CVPR 2026 Highlight; Project page at [this https URL](https://ehsan-ami.github.io/rlftsim)

点击查看摘要

Abstract:Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at this https URL.

147. 【2605.19032】Personalized Face Privacy Protection From a Single Image

链接https://arxiv.org/abs/2605.19032

作者:Zachary Yahn,Fatih Ilhan,Tiansheng Huang,Selim Tekin,Sihao Hu,Yichang Xu,Margaret Loper,Ling Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces uploaded online, scrape facial images, uploaded online, online sources, causing facial recognition

备注

点击查看摘要

Abstract:Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at this https URL

148. 【2605.19027】MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

链接https://arxiv.org/abs/2605.19027

作者:Xiangxiang Cui,Tianjin Huang,Yifang Wang,Lijie Hu,Lu Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse clinical applications, Medical foundation models, tools in healthcare, demonstrating capabilities, emerged as transformative

备注: MICCAI2026

点击查看摘要

Abstract:Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.

149. 【2605.19020】A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

链接https://arxiv.org/abs/2605.19020

作者:Rahul Anand,Siddharth Singh,Dileep A D,Mahadeva Prasanna,Raghavendra Ramachandra

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse visual recognition, Vision foundation models, Presentation Attack Detection, visual recognition tasks, demonstrated strong transferability

备注

点击查看摘要

Abstract:Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

150. 【2605.19004】EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

链接https://arxiv.org/abs/2605.19004

作者:Ahmad Yehia,Abduallah Mohamed,Tianyi Wang,Jiseop Byeon,Kun Qian,Junfeng Jiao,Christian Claudel

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Accurately forecasting human, Accurately forecasting, egocentric perspective plays, wearable sensing systems, forecasting human trajectories

备注: 21 pages, 14 figures. Project page: [this https URL](https://github.com/yehiahmad/EgoTraj)

点击查看摘要

Abstract:Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at this https URL.

151. 【2605.18984】Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

链接https://arxiv.org/abs/2605.18984

作者:Yuqi Tang,Yang Shi,Zhuoran Zhang,Qixun Wang,Xuehai Bai,Yue Ding,Ruizhe Chen,Bohan Zeng,Xinlong Chen,Xuanyu Zhu,Bozhou Li,Yuran Wang,Yifan Dai,Chengzhuo Tong,Xinyu Liu,Yiyan Ji,Yujie Wei,Yuhao Dong,Shilin Yan,Fengxiang Wang,Yi-Fan Zhang,Haotian Wang,Yuanxing Zhang,Pengfei Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent video generative, Multimodal Large Language, Large Language Models, Recent video, structural distortions

备注

点击查看摘要

Abstract:Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

152. 【2605.18974】Harnessing Self-Supervised Features for Art Classification

链接https://arxiv.org/abs/2605.18974

作者:Federico Melis,Davide Bilardello,Emanuele Prato,Evelyn Turri,Lorenzo Baraldi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:Classifying artworks presents, significant challenge due, Classifying artworks, significant challenge, challenge due

备注: IRCDL 2026

点击查看摘要

Abstract:Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

153. 【2605.18956】MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

链接https://arxiv.org/abs/2605.18956

作者:Bizhu Wu,Jinheng Xie,Wenting Chen,Zhe Kong,Jianfeng Ren,Linlin Shen,Ruibin Bai,Rong Qu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent motion-language models, body parts needed, Recent motion-language, animation or interaction, needed for animation

备注

点击查看摘要

Abstract:Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

154. 【2605.18916】CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

链接https://arxiv.org/abs/2605.18916

作者:Gyubin Lee,Junwon Lee,Juhan Nam

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:remaining temporally synchronized, Counterfactual Video Foley, Video Foley Generation, investigate Counterfactual Video, Foley Generation

备注: accepted to CVPR 2026 Workshop on Sight and Sound

点击查看摘要

Abstract:We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing VideoText-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at this https URL

155. 【2605.18903】Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

链接https://arxiv.org/abs/2605.18903

作者:Qiuhe Hong,Yuyang Liu,Shuo Yang,Tiantian Peng,Fei Zhu,Yonghong Tian

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:retaining prior knowledge, Large Language Models, Multimodal Large Language, Balance Continual Learning, Continual Learning

备注

点击查看摘要

Abstract:Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

156. 【2605.18884】Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

链接https://arxiv.org/abs/2605.18884

作者:Zeheng Wang,Bo Zhao,Yijie Zhu,Zhishu Liu,Hui Ma,Ruixin Zhang,Shouhong Ding,Qianyu Xie,Zitong Yu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:human affective states, understand human affective, emotion recognition aims, integrate text, affective states

备注

点击查看摘要

Abstract:Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.

157. 【2605.18880】A Multi-Dimensional Clustering Approach for Identifying Inborn Errors of Immunity

链接https://arxiv.org/abs/2605.18880

作者:Nishad Kulkarni,Alexandra K. Martinson,Nicholas L. Rider,Michael Keller,Syed Muhammad Anwar

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:require early diagnosis, prevent end organ, end organ damage, errors of immunity, require early

备注: Accepted at EMBC 2026

点击查看摘要

Abstract:Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.

158. 【2605.18868】DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

链接https://arxiv.org/abs/2605.18868

作者:Ye Sun,Xin Wang,Jiaming Zhang,Yifeng Gao,Yixu Wang,Yifan Ding,Qixian Zhang,Henghui Ding,Xingjun Ma,Yu-Gang Jiang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:underpin critical tasks, models underpin critical, remain highly vulnerable, complex reasoning, vision and multimodal

备注: 23 pages, 13 figures

点击查看摘要

Abstract:While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

159. 【2605.18860】From Llama to Cria: Scaling Down Neural Networks via Neuron-Level Spectral Structural Importance Evaluation

链接https://arxiv.org/abs/2605.18860

作者:Yongyu Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:spectral structural importance, structural importance evaluation, spectral structural, paper proposes, neuron-level spectral structural

备注

点击查看摘要

Abstract:This paper proposes a neuron pruning framework based on neuron-level spectral structural importance evaluation. Given a trained neural network, we record the hidden states of each hidden layer during inference and model neurons as graph nodes, with hidden states treated as graph signals. Using ideas from graph signal processing, we infer layer-wise input and output graphs that characterize the structural relationships among neurons before and after each layer transformation. We then evaluate the spectral structural importance of neurons by analyzing the transformation between these graphs based on spectral graph theory. Neurons with high spectral structural importance are regarded as strongly involved in the internal representation transformation and are therefore preserved, while neurons with low importance scores are selected as pruning candidates. The pruning process is conducted iteratively until a predefined effective parameter reduction target is reached. Instead of fine-tuning after every pruning step, the proposed strategy first removes low-importance neurons to obtain a compact architecture and then applies a final recovery fine-tuning stage to restore task performance. By connecting neuron pruning with graph signal processing and spectral structural analysis, the proposed framework offers a principled way to reduce neural network size while maintaining solution quality. Experimental results on CIFAR-10 image classification and SST-2 sentiment classification show that our method can effectively remove low-importance neurons and achieve compact networks with competitive performance after recovery fine-tuning.

160. 【2605.18855】Delta Attention Residuals

链接https://arxiv.org/abs/2605.18855

作者:Cheng Luo,Zefan Cai,Junjie Hu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Delta Attention Residuals, Attention Residuals, learned softmax attention, Attention Residuals replace, additive residual connections

备注

点击查看摘要

Abstract:Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight ${\approx}$0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer ($\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight ${\approx}$0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at this https URL.

161. 【2605.18853】INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

链接https://arxiv.org/abs/2605.18853

作者:Ahmed Šabanović,Paul Joe Maliakel,Ivona Brandić

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:incurs communication delay, limited model capacity, faces a tradeoff, edge-only execution, high-quality predictions

备注: 8 pages, 3 figures

点击查看摘要

Abstract:Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.

162. 【2605.18836】Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation

链接https://arxiv.org/abs/2605.18836

作者:Minyoung Oh,Najeong Chae,Jae-Young Sim

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Dataset Distillation, Generalizable Dataset Distillation, Domain Generalizable Dataset, compact synthetic dataset, Dataset

备注: 17pages

点击查看摘要

Abstract:Dataset Distillation (DD) synthesizes a compact synthetic dataset that preserves the training utility of a full dataset. However, its standard formulation assumes that test data follow the same distribution as training data, an assumption that rarely holds in practice. A straightforward extension-applying post-hoc Domain Generalization (DG) techniques to distilled data-is ill-suited because existing DG methods rely on the natural diversity of real datasets, which compact synthetic sets inherently lack, while also incurring substantial augmentation overhead that conflicts with the efficiency objective of dataset distillation. To address this limitation, we introduce Domain Generalizable Dataset Distillation (DGDD), a new problem setting that explicitly targets out-of-distribution (OOD) generalization of distilled datasets. We study this problem through a widely adopted DD baseline of Distribution Matching (DM). We attribute the OOD vulnerability of DM to the entanglement of class-discriminative and domain-specific information within the compressed synthetic set, and propose Spectral Gradient Surgery (SGS) to disentangle the two. The key insight of SGS is that cross-domain agreement among domain-wise gradients in the spectral domain reveals which gradient components are shared across source domains-and are therefore class-discriminative-and which are domain-specific. Based on this observation, SGS augments the standard DM update with two complementary gradients: one that reinforces cross-domain shared components and another that explicitly promotes diversity within the distilled dataset. Extensive experiments on diverse-scale benchmarks demonstrate that SGS substantially improves OOD generalization while remaining plug-and-play compatible with existing DM methods.

163. 【2605.18777】XFlowMap: Cross-Scale Generalization and Mapping of Massive Origin-Destination Data

链接https://arxiv.org/abs/2605.18777

作者:Diansheng Guo,Hai Jin

类目:ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV)

关键词:existing flow-mapping approaches, flow-mapping approaches frequently, approaches frequently rely, predefined aggregation units, multiple spatial scales

备注

点击查看摘要

Abstract:Mapping large origin-destination (OD) datasets remains challenging because flow maps become cluttered, meaningful patterns occur at multiple spatial scales, and existing flow-mapping approaches frequently rely on predefined aggregation units or manual generalization. This paper presents XFlowMap, a framework for the cross-scale generalization and mapping of massive OD data. Specifically, the framework integrates cross-scale flow pattern (cluster) detection, automated flow map generalization, and a new cartographic representation for analyzing and visualizing complex origin-destination flow structures. The approach detects salient flow patterns at their appropriate origin and destination scales, extracts high-level structures, and generates a new flow map representation that supports holistic interpretation of complex origin-destination flow patterns. A scan-statistic-based procedure is developed to evaluate and generalize cross-scale flow clusters. The detected clusters are then visualized using a novel flow symbol that integrates location, direction, strength, and OD scales in a single representation. The framework supports both area-based and point-based OD data, is robust to sparse and noisy datasets, and enables comparative mapping of stratified flow data. Experiments with synthetic data and U.S. migration data demonstrate that the method effectively extracts meaningful cross-scale flow patterns and produces clear, information-rich flow maps for large mobility datasets, supporting both static presentation and interactive exploration.

164. 【2605.20016】FGSVQA: Frequency-Guided Short-form Video Quality Assessment

链接https://arxiv.org/abs/2605.20016

作者:Xinyi Wang,Angeliki Katsenou,Junxiao Shen,David Bull

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:rapid content variation, complex generation pipeline, Short-form video poses, video quality assessment, rapid content

备注: 4 pages, 1 figure

点击查看摘要

Abstract:Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: this https URL.

165. 【2605.19354】Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction

链接https://arxiv.org/abs/2605.19354

作者:Yilmaz Korkmaz,Vishal M. Patel

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:ill-posed inverse problem, inherently ill-posed inverse, incomplete measurements admit, inverse problem, MRI reconstruction

备注

点击查看摘要

Abstract:MRI reconstruction is an inherently ill-posed inverse problem, since incomplete measurements admit many plausible solutions. This ambiguity becomes more severe under high acceleration, where pixel-domain continuous predictors tend to average over feasible reconstructions and suppress high-frequency anatomy. We address this limitation by moving reconstruction to discrete multi-scale latent space and posing it as autoregressive next-acceleration-scale prediction. Leveraging discrete priors proven effective in visual autoregressive modeling, our method restricts the solution to compact sequences of codebook tokens, enabling sharp reconstructions even from extremely sparse measurements. This discrete autoregressive formulation also aligns naturally with modern large language model post-training techniques. Building on this observation, we introduce on-policy privileged information distillation for visual autoregressive modeling, where a teacher is provided training only privileged context that is unavailable at inference, in our case fully sampled acquisitions, and supervises a student trained on its own rollouts, leading to consistent reconstruction gains. Through extensive experiments on the fastMRI benchmark, we show that our approach delivers improved reconstruction performance across diverse sampling patterns under extreme undersampling. Project website is \hyperlink{this https URL}{here}.

166. 【2605.18923】From Division to Decision: Leveraging Temporal Cell-Stage Segmentation for Embryo Transferability Prediction

链接https://arxiv.org/abs/2605.18923

作者:Yasmine Hachani(MALT),Patrick Bouthemy(MALT),Elisa Fromont(MALT),Véronique Duranthon(BREED, ENVA),Ludivine Laffont(BREED),Alline de Paula Reis(BREED, ENVA)

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词:current practice relies, single expert assessment, Accurate selection, challenging task, resulting in high

备注

点击查看摘要

Abstract:Accurate selection of bovine embryos is a challenging task, as current practice relies on a single expert assessment on the seventh day after insemination, resulting in high rates of pregnancy loss. Time-lapse videomicroscopy provides detailed information on early development, but is difficult to exploit because of complex motion patterns and time-consuming analysis. We propose TransFACT, a transformer-based framework for modeling early developmental stages and embryo transferability using 2D time-lapse videos from the first four days of development. TransFACT combines frame-level temporal features with stage-level representations, using developmental stages as auxiliary supervision to predict transferability on day four. Our experiments demonstrate that TransFACT, by leveraging an existing method designed for action recognition, achieves superior performance than its competitor in predicting embryo transferability.

167. 【2605.18878】Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

链接https://arxiv.org/abs/2605.18878

作者:Jana Armouti,Laura Hutchins,Jacob Duplantis,Thomas Deiss,Thales Nogueira Gomes,Keyur H. Patel,Seema Walvekar,Shane Guillory,Thomas H. Fox,Amita Krishnan,Ricardo Rodriguez,Bennett DeBoisblanc,Deva Ramanan,John Galeotti,Gautam Gare

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:congestive heart failure, avoidable healthcare expenditure, Hospital readmission, days of discharge, driver of morbidity

备注

点击查看摘要

Abstract:Hospital readmission within 30 days of discharge is a leading driver of morbidity, mortality, and avoidable healthcare expenditure in congestive heart failure (CHF). Current clinical risk stratification tools rely primarily on non-imaging data and exhibit limited predictive performance. Point-of-care lung ultrasound (LUS) offers a sensitive, noninvasive window into the pulmonary congestion that characterizes CHF decompensation, yet its prognostic utility for readmission prediction remains largely unexplored. We present a pilot feasibility study, the first systematic machine learning study using B-mode LUS acquired during hospitalization to predict 30-day CHF readmission. Quantitative spatiotemporal embeddings are extracted from a pretrained Temporal Shift Module (TSM) ResNet-18 encoder, and interpretable biomarker features are separately evaluated. Through structured ablations over lung view, temporal representation, multi-view fusion, and cross-lung augmentation, we identify the key imaging factors driving readmission risk. Our findings reveal that (1) dependent lower-lung regions (Left-3, Right-3) carry the strongest prognostic signal, consistent with their greater susceptibility to hydrostatic congestion; (2) temporal difference features between sequential examinations substantially outperform single-timepoint representations, highlighting the importance of capturing disease trajectory; and (3) multi-view feature concatenation yields the best overall performance, with our top MLP model achieving an F1 score of 0.80 (95% CI: 0.62-0.96). Biomarker analysis further reveals that pleural-line abnormalities, including breaks and indentations, are as informative as the canonical A-line and B-line markers. These results support POCUS-derived biomarkers as practical, interpretable tools for noninvasive CHF risk stratification.

Subjects:

Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Cite as:
arXiv:2605.18878 [eess.SP]

(or
arXiv:2605.18878v1 [eess.SP] for this version)

https://doi.org/10.48550/arXiv.2605.18878

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
168. 【2605.18791】SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation

链接https://arxiv.org/abs/2605.18791

作者:Chengrui Xiang,Tengfei Ma,Yujie Chen,Tong Wang,Haowen Chen,Xiangxiang Zeng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)

关键词:Existing spectral benchmarks, multimodal language models, modality alignment, Existing spectral, limited in scale

备注: 9 pages,1 figures

点击查看摘要

Abstract:Existing spectral benchmarks are limited in scale, modality alignment, and evaluation scope, and typically focus on either specialized models or multimodal language models (MLLMs). We introduce SpecX, a large-scale benchmark for multi-modal spectroscopy with cross-paradigm evaluation. SpecX contains 1.7M molecules with diverse spectral modalities, including NMR (1H, 13C, HSQC), IR, MS,UV,Raman and FL, and is organized into three tiers: a large-scale dataset for pretraining, an aligned multi-spectral subset for benchmarking, and a high-quality experimental subset for evaluation. SpecX supports a range of tasks such as molecular elucidation, spectrum simulation, and spectral understanding, and enables unified evaluation across both specialized spectral models and MLLMs. Experiments show that specialized models excel at signal-level modeling, while MLLMs exhibit strengths in high-level reasoning but lack precise spectral grounding. SpecX establishes a unified benchmark for spectral intelligence and highlights the need for spectrum-native foundation models.