本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新641篇论文，其中：

自然语言处理103篇
信息检索13篇
计算机视觉141篇

自然语言处理

1. 【2604.13035】SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

作者：Kathakoli Sengupta,Kai Ao,Paola Cascante-Bonilla

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, increasingly generate indoor, Language Models, making judgments sensitive

备注： Project Page: [this https URL](https://lab-spell.github.io/SceneCritic/)

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

2. 【2604.13018】oward Autonomous Long-Horizon Engineering for ML Research

链接：https://arxiv.org/abs/2604.13018

作者：Guoxin Chen,Jie Chen,Lei Chen,Jiale Zhao,Fanzhe Meng,Wayne Xin Zhao,Ruihua Song,Cheng Chen,Ji-Rong Wen,Kai Jia

类目：Computation and Language (cs.CL)

关键词：sustain coherent progress, engineering remains difficult, research engineering remains, autonomous long-horizon engineering, environment setup

备注： Repo: [this https URL](https://github.com/AweAI-Team/AiScientist)

点击查看摘要

Abstract:Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

3. 【2604.13016】Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

链接：https://arxiv.org/abs/2604.13016

作者：Yaxuan Li,Yuxin Zuo,Bingxiang He,Jinqian Zhang,Chaojun Xiao,Cheng Qian,Tianyu Yu,Huan-ang Gao,Wenkai Yang,Zhiyuan Liu,Ning Ding

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, remain poorly understood, dynamics remain poorly, On-policy distillation, training dynamics remain

备注： 30 pages, 23 figures. Code: [this https URL](https://github.com/thunlp/OPD)

点击查看摘要

Abstract:On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

4. 【2604.13006】One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

链接：https://arxiv.org/abs/2604.13006

作者：Erfan Baghaei Potraghloo,Seyedarmin Azizi,Souvik Kundu,Massoud Pedram

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Instruction-tuned large language, models produce helpful, large language models, language models produce, produce helpful

备注：

点击查看摘要

Abstract:Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

5. 【2604.12995】PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

链接：https://arxiv.org/abs/2604.12995

作者：Han Bao,Penghao Zhang,Yue Huang,Zhengqing Yuan,Yanchi Ru,Rui Su,Yujun Zhou,Xiangqi Wang,Kehan Guo,Nitesh V Chawla,Yanfang Ye,Xiangliang Zhang

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Models, Large Language, textbf, Language Models, increasingly integrated

备注： Accepted by ACL 2026 findings

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

6. 【2604.12989】Accelerating Speculative Decoding with Block Diffusion Draft Trees

链接：https://arxiv.org/abs/2604.12989

作者：Liran Ringel,Yaniv Romano

类目：Computation and Language (cs.CL)

关键词：multiple future tokens, propose multiple future, Speculative decoding accelerates, accelerates autoregressive language, Speculative decoding

备注：

点击查看摘要

Abstract:Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

7. 【2604.12978】GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

链接：https://arxiv.org/abs/2604.12978

作者：Amir Hossein Kargaran,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical character recognition, Optical character, cluster of high, advanced rapidly, evaluation has remained

备注：

点击查看摘要

Abstract:Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: this https URL, Benchmark: this https URL.

8. 【2604.12928】MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

链接：https://arxiv.org/abs/2604.12928

作者：Chung-Ming Chien,Manu Orsini,Eugene Kharitonov,Neil Zeghidour,Karen Livescu,Alexandre Défossez

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：recently emerged, emerged to enhance, enhance the naturalness, naturalness of conversational, language models

备注：

点击查看摘要

Abstract:Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

9. 【2604.12919】MetFuse: Figurative Fusion between Metonymy and Metaphor

链接：https://arxiv.org/abs/2604.12919

作者：Saptarshi Ghosh,Tianyu Jiang

类目：Computation and Language (cs.CL)

关键词：largely in isolation, co-occur in natural, computational work, work has studied, studied them largely

备注： ACL 2026

点击查看摘要

Abstract:Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: this https URL.

10. 【2604.12911】Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

链接：https://arxiv.org/abs/2604.12911

作者：Ronald Skorobogat,Ameya Prabhu,Matthias Bethge

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：guide the development, Multilingual, frontier models, Multilingual benchmarks guide, frontier

备注：

点击查看摘要

Abstract:Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

11. 【2604.12843】Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

链接：https://arxiv.org/abs/2604.12843

作者：Eliya Habba,Itay Itzhak,Asaf Yehudai,Yotam Perlitz,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen,Gabriel Stanovsky

类目：Computation and Language (cs.CL)

关键词：Item Response Theory, rapid release, makes it increasingly, increasingly costly, costly to evaluate

备注：

点击查看摘要

Abstract:The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $\rho \geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at this https URL

12. 【2604.12820】RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

链接：https://arxiv.org/abs/2604.12820

作者：Jagadeesh Rachapudi,Pranav Singh,Ritali Vatsi,Praful Hambarde,Amit Shukla

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large-scale web corpora, Large language models, inherently absorb harmful, Large language, inherently absorb

备注：

点击查看摘要

Abstract:Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

13. 【2604.12816】he role of System 1 and System 2 semantic memory structure in human and LLM biases

链接：https://arxiv.org/abs/2604.12816

作者：Katherine Abramski,Giulio Rossetti,Massimo Stella

类目：Computation and Language (cs.CL)

关键词：significant societal risks, pose significant societal, large language models, System, pose significant

备注： 31 pages, 5 figures, 9 appendix figures

点击查看摘要

Abstract:Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

14. 【2604.12776】EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

链接：https://arxiv.org/abs/2604.12776

作者：Shiyu He,Minchi Kuang,Mengxian Wang,Bin Hu,Tingxiang Gu

类目：Computation and Language (cs.CL)

关键词：LLM-based multi-agent systems, Realizing endogenous narrative, Realizing endogenous, Interactive Agent Societies, Endogenous Interactive Agent

备注： Accepted to the Main Conference of ACL 2026

点击查看摘要

Abstract:Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

15. 【2604.12770】aching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

链接：https://arxiv.org/abs/2604.12770

作者：Timon Ziegenbein,Maja Stahl,Henning Wachsmuth

类目：Computation and Language (cs.CL)

关键词：large language models, Editing human-written text, language models, human-written text, standard use case

备注：

点击查看摘要

Abstract:Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one's arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

16. 【2604.12766】NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2604.12766

作者：Jihao Dai(1 and 2),Dingjun Wu(1),Yuxuan Chen(1),Zheni Zeng(2),Yukun Yan(1),Zhenghao Liu(3),Maosong Sun(1) ((1) Tsinghua University, (2) Nanjing University, (3) Northeastern University)

类目：Computation and Language (cs.CL)

关键词：maps queries directly, Retrieval-augmented generation, isolated text segments, flat retrieval paradigm, typically relies

备注：

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method's capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

17. 【2604.12748】Generating Effective CoT Traces for Mitigating Causal Hallucination

链接：https://arxiv.org/abs/2604.12748

作者：Yiheng Zhao,Jun Yan

类目：Computation and Language (cs.CL)

关键词：complex reasoning tasks, event causality identification, causal hallucination, large language models, excel in complex

备注： 11 pages, 2 figures. Accepted at ACL 2026

点击查看摘要

Abstract:Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

18. 【2604.12744】Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

链接：https://arxiv.org/abs/2604.12744

作者：Terra Blevins,Stephen Mayhew,Marek Šuppa,Hila Gonen,Shachar Mirkin,Vasile Pais,Kaja Dobrovoljc,Voula Giouli,Jun Kevin,Eugene Jang,Eungseo Kim,Jeongyeon Seo,Xenophon Gialis,Yuval Pinter

类目：Computation and Language (cs.CL)

关键词：assumptions remain scarce, language models promise, multilingual language models, Named Entity Recognition, gold-standard evaluation benchmarks

备注： LREC 2026

点击查看摘要

Abstract:While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

19. 【2604.12736】oken-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

链接：https://arxiv.org/abs/2604.12736

作者：Xingyu Lin,Yilin Wen,Du Su,Jinchang Hou,En Wang,Wenbin Liu,Chenfu Bao,Zhonghou Lv

类目：Computation and Language (cs.CL)

关键词：Relative Policy Optimization, Group Relative Policy, Group Relative, large language models, Policy Optimization

备注：

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

20. 【2604.12721】InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

链接：https://arxiv.org/abs/2604.12721

作者：Shreya Gupta,Prottay Kumar Adhikary,Bhavyaa Dave,Salam Michael Singh,Aniket Deroy,Tanmoy Chakraborty

类目：Computation and Language (cs.CL)

关键词：organizes patient symptoms, formulation organizes patient, organizes patient, patient symptoms, symptoms and psychosocial

备注：

点击查看摘要

Abstract:Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

21. 【2604.12710】LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

链接：https://arxiv.org/abs/2604.12710

作者：Junxiao Yang,Haoran Liu,Jinzhe Tu,Jiale Cheng,Zhexin Zhang,Shiyao Cui,Jiaqi Weng,Jialing Tao,Hui Xue,Hongning Wang,Han Qiu,Minlie Huang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：exhibit severe vulnerabilities, Large language models, Large language, demonstrate strong safety, strong safety performance

备注：

点击查看摘要

Abstract:Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.

22. 【2604.12666】From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

链接：https://arxiv.org/abs/2604.12666

作者：Chuang Peng,Wei Zhang,Renshuai Tao,Xinhao Zhang,Jian Yang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Text-based web agents, agents offer computational, agents remains challenging, offer computational efficiency, remains challenging due

备注： 17 pages, 10 figures

点击查看摘要

Abstract:Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

23. 【2604.12659】Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

链接：https://arxiv.org/abs/2604.12659

作者：Kaiqi Hu,Linda Xiao,Shiyue Xu,Ziyi Tang,Mingwen Liu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Vision-language models, benchmarks inadequately evaluate, increasingly applied, inadequately evaluate, evaluate their understanding

备注： We evaluate whether VLMs can comprehend multi-scale visual stock price data like human analysts with a proposed benchmark, identifying current VLMs' weak predictive power, significant biases, and limited sensitivity to forecast horizons and prompts

点击查看摘要

Abstract:Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs' comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs' ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

24. 【2604.12651】Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

链接：https://arxiv.org/abs/2604.12651

作者：Alkid Baci,Luke Friedrichs,Caglar Demir,N'Dah Jean Kouagou,Axel-Cyrille Ngonga Ngomo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Knowledge graph embedding, Knowledge graph, graph embedding, heterogeneous graphs, struggle with unseen

备注：

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists this http URL$, $\geq 5 \; this http URL$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: this https URL .

25. 【2604.12647】Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

链接：https://arxiv.org/abs/2604.12647

作者：Tsai-Ning Wang,Herman Teun den Dekker,Lin-Lin Chen,Neil Zeghidour,Aaqib Saeed

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：non-invasive disease screening, costly expert annotation, scarce labeled data, Automated respiratory audio, analysis promises scalable

备注： Accepted at AHLI CHIL 2026

点击查看摘要

Abstract:Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

26. 【2604.12634】RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

链接：https://arxiv.org/abs/2604.12634

作者：Dylan R. Ashley,Gaël Le Lan,Changsheng Zhao,Naina Dhingra,Zhipeng Cai,Ernie Chang,Mingchen Zhuge,Yangyang Shi,Vikas Chandra,Jürgen Schmidhuber

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Large language models, computationally limited devices, Large language, number of parameters, face a fundamental

备注： 10 pages in main text + 6 pages of references + 36 pages of appendices, 12 figures in main text + 37 figures in appendices, 2 tables in main text + 3 table in appendices, 13 prompts in appendices

点击查看摘要

Abstract:Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

27. 【2604.12633】Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

链接：https://arxiv.org/abs/2604.12633

作者：Vadim Borisov

类目：Computation and Language (cs.CL)

关键词：settings remains constrained, multilingual settings remains, predominantly English, Emotion classification, existing corpora

备注：

点击查看摘要

Abstract:Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at this https URL

28. 【2604.12630】GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

链接：https://arxiv.org/abs/2604.12630

作者：Zhaochen Liu,Limeng Qiao,Guanglu Wan,Tingting Jiang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal large language, Multimodal large, large language models, exhibited remarkable performance, large language

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

29. 【2604.12610】ransforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

链接：https://arxiv.org/abs/2604.12610

作者：Xudong Wang,Chaoning Zhang,Qigan Sun,Zhenzhen Huang,Chang Lu,Sheng Zheng,Zeyu Ma,Caiyan Qin,Yang Yang,Hengtao Shen

类目：Computation and Language (cs.CL)

关键词：mitigates hallucination, hallucination in large, large language models, RAG, incorporating external knowledge

备注： 12 pages, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

30. 【2604.12559】FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

链接：https://arxiv.org/abs/2604.12559

作者：Peng Wang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu

类目：Computation and Language (cs.CL)

关键词：Unstructured model editing, Unstructured model, memorize text holistically, model editing aims, existing methods

备注： ACL 2026 findings

点击查看摘要

Abstract:Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at this https URL.

31. 【2604.12540】When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

链接：https://arxiv.org/abs/2604.12540

作者：Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：scarcity limits NLP, limits NLP development, West African languages, limits NLP, NLP development

备注： 13 pages, 6 tables; previously submitted to KDD 2026

点击查看摘要

Abstract:Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

32. 【2604.12518】Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

链接：https://arxiv.org/abs/2604.12518

作者：Kang He,Yuzhe Ding,Xinrong Wang,Fei Li,Chong Teng,Donghong Ji

类目：Computation and Language (cs.CL)

关键词：Multimodal sentiment analysis, integrates heterogeneous text, infer human emotions, Multimodal sentiment, sentiment analysis

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

33. 【2604.12506】Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

链接：https://arxiv.org/abs/2604.12506

作者：Linhao Zhang,Yuhan Song,Aiwei Liu,Chuhan Wu,Sijun Zhang,Wei Jia,Yuan Liu,Houfeng Wang,Xiao Zhou

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Recent Audio Large, Audio Large Language, Large Language Models, Large Language, striking performance inversion

备注： Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components -- Transcription, Paralinguistics, and Non-linguistic Events -- within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at this https URL.

34. 【2604.12503】opology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

链接：https://arxiv.org/abs/2604.12503

作者：Shuai Wang,Xixi Wang,Yinan Yu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, shown remarkable capabilities, Base Question Answering

备注： 12 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: this https URL.

35. 【2604.12493】Latent Planning Emerges with Scale

链接：https://arxiv.org/abs/2604.12493

作者：Michael Hanna,Emmanuel Ameisen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：perform seemingly planning-intensive, writing coherent stories, seemingly planning-intensive tasks, functioning code, perform seemingly

备注： ICLR 2026

点击查看摘要

Abstract:LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like "accountant", and cause them to output "an" rather than "a"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.

36. 【2604.12491】Calibrated Confidence Estimation for Tabular Question Answering

链接：https://arxiv.org/abs/2604.12491

作者：Lukas Voss

类目：Computation and Language (cs.CL)

关键词：Large language models, tabular question answering, Toggle, Large language, Toggle Hugging Face

备注： 27 pages, 9 figures, 17 tables (8-page main body + appendix)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

Comments:
27 pages, 9 figures, 17 tables (8-page main body + appendix)

Subjects:

Computation and Language (cs.CL)

ACMclasses:
I.2.7; I.2.6

Cite as:
arXiv:2604.12491 [cs.CL]

(or
arXiv:2604.12491v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.12491

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Lukas Voss [view email] [v1]
Tue, 14 Apr 2026 09:16:53 UTC (187 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Calibrated Confidence Estimation for Tabular Question Answering, by Lukas VossView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CL

|
next

new
|
recent
| 2026-04

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

37. 【2604.12487】KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

链接：https://arxiv.org/abs/2604.12487

作者：Shuai Wang,Yinan Yu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, natural language understanding, Large Language, Language Models, exhibit strong abilities

备注： 15 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: this https URL.

38. 【2604.12479】Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

链接：https://arxiv.org/abs/2604.12479

作者：Shanyong Wang,Shuhang Lin,Yining Zhao,Xi Zhu,Yongfeng Zhang

类目：Computation and Language (cs.CL)

关键词：Recent advances, large language models, advances in large, large language, significantly improved

备注： 20 pages, 13 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

39. 【2604.12477】Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

链接：https://arxiv.org/abs/2604.12477

作者：Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, linguistic knowledge encoded, models remains accessible, low-resource language communities, Large language

备注： 11 pages, 5 figures, 6 tables; to appear in LREC-COLING 2026

点击查看摘要

Abstract:Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

40. 【2604.12471】Beyond Single-Dimension Novelty: How Combinations of Theory, Method, and Results-based Novelty Shape Scientific Impact

链接：https://arxiv.org/abs/2604.12471

作者：Yi Zhao,Yang Chenggang,Yuzhuo Wang,Tong Bao,Zhang Heng,Chengzhi Zhang

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：novelty, scientific impact, Scientific novelty drives, Scientific, research frontier

备注： AII-EEKE 2026

点击查看摘要

Abstract:Scientific novelty drives advances at the research frontier, yet it is also associated with heightened uncertainty and potential resistance from incumbent paradigms, leading to complex patterns of scientific impact. Prior studies have primarily ex-amined the relationship between a single dimension of novelty -- such as theoreti-cal, methodological, or results-based novelty -- and scientific impact. However, because scientific novelty is inherently multidimensional, focusing on isolated dimensions may obscure how different types of novelty jointly shape impact. Consequently, we know little about how combinations of novelty types influence scientific impact. To this end, we draw on a dataset of 15,322 articles published in Nature Communications. Using the DeepSeek-V3 model, we classify articles into three novelty dimensions based on the content of their Introduction sections: theoretical novelty, methodological novelty, and results-based novelty. These dimensions may coexist within the same article, forming distinct novelty configura-tions. Scientific impact is measured using five-year citation counts and indicators of whether an article belongs to the top 1% or top 10% highly cited papers. Descriptive results indicate that results-based novelty alone and the simultaneous presence of all three novelty types are the dominant configurations in the sample. Regression results further show that articles with results-based novelty only re-ceive significantly more citations and are more likely to rank among the top 1% and top 10% highly cited papers than articles exhibiting all three novelty types. These findings advance our understanding of how multidimensional novelty configurations shape knowledge diffusion.

41. 【2604.12452】Latent-Condensed Transformer for Efficient Long Context Modeling

链接：https://arxiv.org/abs/2604.12452

作者：Zeng You,Yaofo Chen,Qiuwu Chen,Ying Sun,Shuhai Zhang,Yingjian Li,Yaowei Wang,Mingkui Tan

类目：Computation and Language (cs.CL)

关键词：Large language models, face significant challenges, Large language, processing long contexts, long contexts due

备注： Accepted by ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

42. 【2604.12442】GLeMM: A large-scale multilingual dataset for morphological research

链接：https://arxiv.org/abs/2604.12442

作者：Hathout Nabil(CLLE, Comue de Toulouse),Basilio Calderone(CLLE, UBM),Fiammetta Namer(ATILF, UL),Franck Sajous(CLLE-ERSS, Comue de Toulouse)

类目：Computation and Language (cs.CL)

关键词：relations between words, mechanisms govern, govern the variation, variation in form-meaning, form-meaning relations

备注：

点击查看摘要

Abstract:In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

43. 【2604.12426】Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

链接：https://arxiv.org/abs/2604.12426

作者：Alicia Curth,Rachel Lawrence,Sushrut Karmalkar,Niranjani Prasad

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：investigate whether transformers, increasing difficulty, depth adaptively, Abstract, adaptive depth

备注： Accepted at the ICLR 2026 Workshop on Logical Reasoning of Large Language Models

点击查看摘要

Abstract:We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

44. 【2604.12424】Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

链接：https://arxiv.org/abs/2604.12424

作者：Sihang Jia,Shuliang Liu,Songbo Yang,Yibo Yan,Xin Zou,Xuming Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Models frequently suffer, Language Models frequently, Multimodal Large

备注：

点击查看摘要

Abstract:Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

45. 【2604.12421】Agentic Insight Generation in VSM Simulations

链接：https://arxiv.org/abs/2604.12421

作者：Micha Selak,Dirk Krechel,Adrian Ulges,Sven Spieckermann,Niklas Stoehr,Andreas Loehr

类目：Computation and Language (cs.CL)

关键词：Extracting actionable insights, stream map simulations, Extracting actionable, actionable insights, insights from complex

备注：

点击查看摘要

Abstract:Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework's viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

46. 【2604.12397】KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

链接：https://arxiv.org/abs/2604.12397

作者：Yudong Li,Jiawei Cai,Linlin Shen

类目：Computation and Language (cs.CL)

关键词：Standard Large Language, Large Language Model, Standard Large, flattened token sequences, Large Language

备注： Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30\%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

47. 【2604.12385】From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

链接：https://arxiv.org/abs/2604.12385

作者：Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Hang Zeng,Shaojie Tang,Fan Wu,Guihai Chen

类目：Computation and Language (cs.CL)

关键词：large language models, language models, predominant form, large language, Multi-turn dialogue

备注：

点击查看摘要

Abstract:Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

48. 【2604.12378】ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

链接：https://arxiv.org/abs/2604.12378

作者：Daniil Gurgurov,Tom Röhr,Sebastian von Rohrscheidt,Josef van Genabith,Alexander Löser,Simon Ostermann

类目：Computation and Language (cs.CL)

关键词：remain English-centric, multilingual capabilities, advances in multilingual, reasoning traces, English-centric

备注： Under review

点击查看摘要

Abstract:Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.

Comments:
Under review

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.12378 [cs.CL]

(or
arXiv:2604.12378v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.12378

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

49. 【2604.12377】SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

链接：https://arxiv.org/abs/2604.12377

作者：SungHo Kim,Juhyeong Park,Eda Atalay,SangKeun Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：featural writing system, morphologically rich language, morphologically rich, featural writing, writing system

备注： Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at this https URL.

50. 【2604.12376】Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

链接：https://arxiv.org/abs/2604.12376

作者：Ziyang Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLM conversations grow, retrieve full content, model recover, context window, conversations grow

备注： 16 pages, 10 figures, 16 tables

点击查看摘要

Abstract:When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

51. 【2604.12374】Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

链接：https://arxiv.org/abs/2604.12374

作者：NVIDIA:Aakshita Chandiramani,Aaron Blakeman,Abdullahi Olaoye,Abhibha Gupta,Abhilash Somasamudramath,Abhinav Khattar,Adeola Adesoba,Adi Renduchintala,Adil Asif,Aditya Agrawal,Aditya Vavre,Ahmad Kiswani,Aishwarya Padmakumar,Ajay Hotchandani,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Gronskiy,Alex Kondratenko,Alex Neefus,Alex Steiner,Alex Yang,Alexander Bukharin,Alexander Young,Ali Hatamizadeh,Ali Taghibakhshi,Alina Galiautdinova,Alisa Liu,Alok Kumar,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Anahita Bhiwandiwalla,Ananth Subramaniam,Andrew Tao,Anjaney Shrivastava,Anjulie Agrusa,Ankur Srivastava,Ankur Verma,Ann Guan,Anna Shors,Annamalai Chockalingam,Anubhav Mandarwal,Aparnaa Ramani,Arham Mehta,Arti Jain,Arun Venkatesan,Asha Anoosheh,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asli Sabanci Demiroz,Asma Kuriparambil Thekkumpate,Atefeh Sohrabizadeh,Avinash Kaur,Ayush Dattagupta,Barath Subramaniam Anandan,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Benjamin Chislett,Besmira Nushi,Bilal Kartal,Bill Thiede,Bita Darvish Rouhani,Bobby Chen,Boris Ginsburg,Brandon Norick,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Buvaneswari Mani,Carlo del Mundo,Chankyu Lee,Chanran Kim,Chantal Hwang,Chao Ni,Charles Wang,Charlie Truong,Cheng-Ping Hsieh,Chenhan Yu,Chenjie Luo,Cherie Wang,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Chris Holguin,Chris Wing,Christian Munley,Christopher Parisien,Chuck Desai,Chunyang Sheng,Collin Neale,Cyril Meurillon,Dakshi Kumar

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：parameter hybrid Mamba-Attention, describe the pre-training, hybrid Mamba-Attention, Nemotron, billion

备注：

点击查看摘要

Abstract:We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

52. 【2604.12373】Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

链接：https://arxiv.org/abs/2604.12373

作者：Tomer Ashuach,Liat Ein-Dor,Shai Gretz,Yoav Katz,Yonatan Belinkov

类目：Computation and Language (cs.CL)

关键词：Humans use introspection, internal states inaccessible, private internal states, understanding through private, private internal

备注： Accepted to ACL 2026 (Main Conference). 8 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

53. 【2604.12359】Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

链接：https://arxiv.org/abs/2604.12359

作者：Rui Yin,Tianxu Han,Naen Xu,Changjiang Li,Ping He,Chunyi Zhou,Jun Wang,Zhihui Fu,Tianyu Du,Jinbao Li,Shouling Ji

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：distribute backdoored checkpoints, large language models, Safety-aligned large language, real-world pipelines, adversaries can distribute

备注： ACL 2026 Main Conference

点击查看摘要

Abstract:Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., ``Sure''), which does not guarantee sustained harmful output -- the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.

54. 【2604.12352】MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

链接：https://arxiv.org/abs/2604.12352

作者：Joongmin Shin,Chanjun Park,Jeongbae Park,Jaehyung Seo,Heuiseok Lim

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：processing long industrial, long industrial documents, long industrial document, powerful method, method for processing

备注：

点击查看摘要

Abstract:RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.

55. 【2604.12321】oxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

链接：https://arxiv.org/abs/2604.12321

作者：Boyang Li,Hongzhe Shou,Yuanyuan Liang,Jingbin Zhang,Fang Zhou

类目：Computation and Language (cs.CL)

关键词：Existing Chinese toxic, Existing Chinese, Chinese toxic content, target sentence-level classification, content detection methods

备注： Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at this https URL.

56. 【2604.12312】CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

链接：https://arxiv.org/abs/2604.12312

作者：Jingbo Yang,Guanyu Yao,Bairu Hou,Xinghan Yang,Nikolai Glushnev,Iwona Bialynicka-Birula,Duo Ding,Shiyu Chang

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, domain-specific operational guidelines, enterprise environments, ensuring their strict

备注：

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

57. 【2604.12308】ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

链接：https://arxiv.org/abs/2604.12308

作者：Haoran Li,Yulin Chen,Huihao Jing,Wenbin Hu,Tsz Ho Li,Chanhou Lou,Hong Ting Tsang,Sirui Han,Yangqiu Song

类目：Computation and Language (cs.CL)

关键词：Individuals' concerns, sensitive patterns, extend beyond sensitive, Data Protection Regulation, highly contextualized

备注： Accepted by ACL 26

点击查看摘要

Abstract:Individuals' concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs' compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

58. 【2604.12290】Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

链接：https://arxiv.org/abs/2604.12290

作者：Yizhe Chi,Deyao Hong,Dapeng Jiang,Tianwei Luo,Kaisen Yang,Boshi Zhang,Zhe Cao,Xiaoyan Fan,Bingxiang He,Han Hao,Weiyang Jin,Dianqiao Lei,Qingle Liu,Houde Qian,Bowen Wang,Situ Wang,Youjie Zheng,Yifan Zhou,Calvin Xiao,Eren Cai,Qinhuai Na

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Current LLM agent, Current LLM, search-based question answering, LLM agent benchmarks, binary pass

备注：

点击查看摘要

Abstract:Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

59. 【2604.12289】he Enforcement and Feasibility of Hate Speech Moderation on Twitter

链接：https://arxiv.org/abs/2604.12289

作者：Manuel Tonneau,Dylan Thurgood,Diyi Liu,Niyati Malhotra,Victor Orozco-Olvera,Ralph Schroeder,Scott A. Hale,Manoel Horta Ribeiro,Paul Röttger,Samuel P. Fraiberger

类目：Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词：substantial social harms, consistently platforms enforce, hate speech, platforms enforce hate, enforce hate speech

备注：

点击查看摘要

Abstract:Online hate speech is associated with substantial social harms, yet it remains unclear how consistently platforms enforce hate speech policies or whether enforcement is feasible at scale. We address these questions through a global audit of hate speech moderation on Twitter (now X). Using a complete 24-hour snapshot of public tweets, we construct representative samples comprising 540,000 tweets annotated for hate speech by trained annotators across eight major languages. Five months after posting, 80% of hateful tweets remain online, including explicitly violent hate speech. Such tweets are no more likely to be removed than non-hateful tweets, with neither severity nor visibility increasing the likelihood of removal. We then examine whether these enforcement gaps reflect technical limits of large-scale moderation systems. While fully automated detection systems cannot reliably identify hate speech without generating large numbers of false positives, they effectively prioritize likely violations for human review. Simulations of a human-AI moderation pipeline indicate that substantially reducing user exposure to hate speech is economically feasible at a cost below existing regulatory penalties. These results suggest that the persistence of online hate cannot be explained by technical constraints alone but also reflects institutional choices in the allocation of moderation resources.

60. 【2604.12282】owards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

链接：https://arxiv.org/abs/2604.12282

作者：Houxing Ren,Mingjie Zhan,Zimu Lu,Ke Wang,Yunqiao Yang,Haotian Hou,Hongsheng Li

类目：Computation and Language (cs.CL)

关键词：scientific data management, enterprise reporting, data management, scientific data, spreadsheet

备注： Accepted to ACL 2026 (main conference)

点击查看摘要

Abstract:Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at this https URL.

61. 【2604.12268】CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

链接：https://arxiv.org/abs/2604.12268

作者：Zaoyu Chen,Jianbo Dai,Boyu Zhu,Jingdong Wang,Huiming Wang,Xin Xu,Haoyang Yuan,Zhijiang Guo,Xiao-Ming Wu

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：Large language models, Large language, behavior remains unclear, remains unclear, natural language

备注：

点击查看摘要

Abstract:Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution-based evaluation protocol. CodeSpecBench supports both function-level and repository-level tasks and encodes specifications as executable Python functions. Constructed from diverse real-world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15 state-of-the-art LLMs on CodeSpecBench, we observe a sharp performance degradation on repository-level tasks, where the best model attains only a 20.2% pass rate. We further find that specification generation is substantially more challenging than code generation, indicating that strong coding performance does not necessarily reflect deep understanding of intended program semantics. Our data and code are available at this https URL.

62. 【2604.12262】CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

链接：https://arxiv.org/abs/2604.12262

作者：Raeyoung Chang,Dongwook Kwon,Jisoo Lee,Nikhil Verma

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Cascaded LLM systems, Cascaded LLM, LLM systems coordinate, abstention under uncertainty, varying sizes

备注： 12 pages, 6 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier's escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

63. 【2604.12258】Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

链接：https://arxiv.org/abs/2604.12258

作者：Taehun Kim,Hyeryun Park,Hyeonhoon Lee,Yushin Lee,Kyungsang Kim,Hyung-Chul Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：requiring domain expertise, involves labor-intensive processes, research involves labor-intensive, sensitive patient data, programming skills

备注： 10 pages, 5 figures, 2 tables, Supplementary Appendix

点击查看摘要

Abstract:Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

64. 【2604.12250】How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm

链接：https://arxiv.org/abs/2604.12250

作者：Taisei Hishiki,Takaya Arita,Reiji Suzuki

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)

关键词：Large Language Model, Large Language, Language Model, Social Particle Swarm, characteristics of Large

备注： 12 pages, 6 figures and 2 tables

点击查看摘要

Abstract:This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner's Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents' reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.

65. 【2604.12247】SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

链接：https://arxiv.org/abs/2604.12247

作者：Zhuofan Wen,Yang Feng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language models, accelerate autoregressive inference, Speculative decoding, promising approach, approach to accelerate

备注： ACL 2026 Findings

点击查看摘要

Abstract:Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

66. 【2604.12243】Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

链接：https://arxiv.org/abs/2604.12243

作者：Jinkai Tao,Yubo Wang,Xiaoyu Liu,Menglin Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：generation requires tracking, Continuous Knowledge Metabolism, requires tracking, hypothesis generation requires, Scientific hypothesis generation

备注： 32 pages, 6 figures

点击查看摘要

Abstract:Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p0.001), and best-match alignment (+0.43, p0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen's d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field's trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

67. 【2604.12237】MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization

链接：https://arxiv.org/abs/2604.12237

作者：Ziqing Wang,Yibo Wen,Abhishek Pandy,Han Liu,Kaize Ding

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：preserving structural similarity, molecular optimization aims, improve molecular properties, drug discovery, original molecule

备注：

点击查看摘要

Abstract:In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbf{Mol}ecular optimization with \textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90\% success on single-property tasks (1.5$\times$ over the best baseline) and 52\% on multi-property tasks using only 500 oracle calls. Our code is available at this https URL.

68. 【2604.12231】hought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

链接：https://arxiv.org/abs/2604.12231

作者：Tao Feng,Pengrui Han,Guanyu Lin,Ge Liu,Jiaxuan You

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large language models, powerful internal capabilities, Large language, language models, transformed AI research

备注：

点击查看摘要

Abstract:Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

69. 【2604.12229】HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

链接：https://arxiv.org/abs/2604.12229

作者：Jawad Hossain,Xiangyu Guo,Jiawei Zhou,Chong Liu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：maintain long chains, Small language models, Small language, struggle with complex, due to limited

备注： 15 pages, 5 figures, Preprint

点击查看摘要

Abstract:Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.

70. 【2604.12227】Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

链接：https://arxiv.org/abs/2604.12227

作者：Xiuxiu Tang,G. Alex Ambrose,Ying Cheng

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：creating substantial variation, combine symbolic expressions, Student responses, symbolic expressions, creating substantial

备注：

点击查看摘要

Abstract:Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

71. 【2604.12223】LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

链接：https://arxiv.org/abs/2604.12223

作者：Jiechao Gao,Rohan Kumar Yadav,Yuangang Li,Yuandong Pan,Jie Wang,Ying Liu,Michael Lepech

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Tsetlin Machine, BERT provide strong, lack semantic generalization, strong semantic representations, provide strong semantic

备注： Accepted to Findings of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

72. 【2604.12216】meMark: A Trustworthy Time Watermarking Framework for Exact Generation-Time Recovery from AIGC

链接：https://arxiv.org/abs/2604.12216

作者：Shangkun Che,Silin Du,Ge Gao

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, raised increasing concerns, Language Models, raised increasing

备注：

点击查看摘要

Abstract:The widespread use of Large Language Models (LLMs) in text generation has raised increasing concerns about intellectual property disputes. Watermarking techniques, which embed meta information into AI-generated content (AIGC), have the potential to serve as judicial evidence. However, existing methods rely on statistical signals in token distributions, leading to inherently probabilistic detection and reduced reliability, especially in multi-bit encoding (e.g., timestamps). Moreover, such methods introduce detectable statistical patterns, making them vulnerable to forgery attacks and enabling model providers to fabricate arbitrary watermarks. To address these issues, we propose the concept of trustworthy watermark, which achieves reliable recovery with 100% identification accuracy while resisting both user-side statistical attacks and provider-side forgery. We focus on trustworthy time watermarking for use as judicial evidence. Our framework integrates cryptographic techniques and encodes time information into time-dependent secret keys under regulatory supervision, preventing arbitrary timestamp fabrication. The watermark payload is decoupled from time and generated as a random, non-stored bit sequence for each instance, eliminating statistical patterns. To ensure verifiability, we design a two-stage encoding mechanism, which, combined with error-correcting codes, enables reliable recovery of generation time with theoretically perfect accuracy. Both theoretical analysis and experiments demonstrate that our framework satisfies the reliability requirements for judicial evidence and offers a practical solution for future AIGC-related intellectual property disputes.

73. 【2604.12210】Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

链接：https://arxiv.org/abs/2604.12210

作者：Weikang Zhang,Zimo Zhu,Zhichuan Yang,Chen Huang,Wenqiang Lei,See-Kiong Ng

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Simulating Standardized Patients, Simulating Standardized, Standardized Patients, cognitive impairment offers, offers a scalable

备注： Findings of ACL 2026

点击查看摘要

Abstract:Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

74. 【2604.12196】Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

链接：https://arxiv.org/abs/2604.12196

作者：Manh Nguyen,Sunil Gupta,Hung Le

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, frequently generate multiple, frequently generate, remains challenging

备注：

点击查看摘要

Abstract:Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

75. 【2604.12195】Representing expertise accelerates learning from pedagogical interaction data

链接：https://arxiv.org/abs/2604.12195

作者：Dhara Yu,Karthikeya Kaushik,Bill D. Thompson

类目：Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Work in cognitive, exposing learning agents, cognitive science, science and artificial, artificial intelligence

备注：

点击查看摘要

Abstract:Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

76. 【2604.12185】Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

链接：https://arxiv.org/abs/2604.12185

作者：Keshu Wu,Chenchen Kuai,Zihao Li,Jiwan Jiang,Shiyu Shen,Shian Wang,Chan-Wei Hu,Zhengzhong Tu,Yang Zhou

类目：Computation and Language (cs.CL)

关键词：enhances large language, Retrieval-augmented generation, large language models, enhances large, large language

备注：

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

77. 【2604.12179】AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

链接：https://arxiv.org/abs/2604.12179

作者：Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Quan Z. Sheng

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, memories remain difficult, remain difficult due

备注： 13 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

78. 【2604.12177】Policy-Invisible Violations in LLM-Based Agents

链接：https://arxiv.org/abs/2604.12177

作者：Jie Wu,Ming Gong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：correct policy judgment, syntactically valid, decision time, facts needed, needed for correct

备注： 26 pages,1 figure, 11 tables

点击查看摘要

Abstract:LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

79. 【2604.12162】AlphaEval: Evaluating Agents in Production

链接：https://arxiv.org/abs/2604.12162

作者：Pengrui Lu,Bingyu Xu,Wenjun Zhang,Shengjia Hua,Xuanjian Gao,Ranxiang Ge,Lyumanshan Ye,Linxuan Wu,Yiran Li,Junfei Fish Yu,Yibo Zhang,Ruixin Li,Manxiang Li,Xiao Han,Xiaocong Zhou,Guangyao Chi,Zisheng Chen,Kaishen Chen,Kun Wang,Qihua Xu,Fengyue Meng,Yuchen Ni,Jiajun Li,Jinxiu Liu,Danfeng Zhang,Jingru Zhao,Pengfei Liu

类目：Computation and Language (cs.CL)

关键词：reflect production realities, Occupational Information Network, rapid deployment, settings has outpaced, outpaced the development

备注：

点击查看摘要

Abstract:The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products -- Claude Code, Codex, etc. -- as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework -- a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

80. 【2604.12147】From Plan to Action: How Well Do Agents Follow the Plan?

链接：https://arxiv.org/abs/2604.12147

作者：Shuyang Liu,Saman Dehghan,Jatin Ganhotra,Martin Hirzel,Reyhaneh Jabbarvand

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：task-specific prompt crafting, crafting through autonomous, plan, aspire to eliminate, prompt crafting

备注：

点击查看摘要

Abstract:Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model's internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.

81. 【2604.12138】Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2604.12138

作者：Aditya Agrawal,Alwarappan Nakkiran,Darshan Fofadiya,Alex Karlsson,Harsha Aduri

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：LLMs access external, current implementations exhibit, prioritize objective retrieval, access external knowledge, prioritize objective

备注： 13 pages, Preprint under review

点击查看摘要

Abstract:RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

82. 【2604.12128】When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

链接：https://arxiv.org/abs/2604.12128

作者：Ji Ho Bae

类目：Computation and Language (cs.CL)

关键词：internal matrix dynamics, self-referential inputs alter, large language models, inputs alter, alter the internal

备注： 14 pages, 4 figures, 11 tables

点击查看摘要

Abstract:We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families -- Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B -- over 300 prompts in a 14-level hierarchy at three temperatures ($T \in \{0.0, 0.3, 0.7\}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) -- truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank -- indicating attention reorganization with global dispersion rather than simple concentration collapse -- and key metrics reach Cohen's $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q 0.05$), 198 with $|d| 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

83. 【2604.12126】Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

链接：https://arxiv.org/abs/2604.12126

作者：Rongzhe Wei,Ge Shi,Min Cheng,Na Zhang,Pan Li,Sarthak Ghosh,Vaibhav Gorde,Leman Akoglu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, enabling autonomous reasoning, Large Language, significantly advanced tool-augmented

备注： This work was completed during an internship at Amazon

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.

84. 【2604.12099】he Effect of Document Selection on Query-focused Text Analysis

链接：https://arxiv.org/abs/2604.12099

作者：Sandesh S Rangreji,Mian Zhong,Anjalie Field

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：computational constraints preclude, constraints preclude analyzing, selection strategy choices, strategy choices, document collections

备注：

点击查看摘要

Abstract:Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.

85. 【2604.12097】mporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

链接：https://arxiv.org/abs/2604.12097

作者：Zhanwei Cao,YeoJin Go,Yifan Hu,Shanu Sushmita

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, generating responses independently, language models, generating responses

备注： 25 pages, 6 figures. To appear in Findings of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

86. 【2604.12076】Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

链接：https://arxiv.org/abs/2604.12076

作者：Syed Rifat Raiyan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Identifiable Victim Effect, facing equivalent hardship, allocate greater resources, statistically characterized group, characterized group facing

备注： Under review, 49 pages, 20 figures, 11 tables

点击查看摘要

Abstract:The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen's d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

87. 【2604.12069】Robust Explanations for User Trust in Enterprise NLP Systems

链接：https://arxiv.org/abs/2604.12069

作者：Guilin Zhang,Kai Zhao,Jeffrey Friedman,Xu Chu,Amine Anoun,Jerry Ting

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：real user noise, existing studies provide, studies provide limited, provide limited guidance, enterprise NLP

备注：

点击查看摘要

Abstract:Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

88. 【2604.12056】LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

链接：https://arxiv.org/abs/2604.12056

作者：Haocheng Xi,Harman Singh,Yuezhou Hu,Coleman Hooper,Rishabh Tiwari,Aditya Tomar,Minjae Lee,Wonjun Kang,Michael Mahoney,Chenfeng Xu,Kurt Keutzer,Amir Gholami

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：diffusion language models, autoregressive decoding pipeline, Block-wise diffusion language, language models, offering a promising

备注： 16 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

89. 【2604.12049】Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

链接：https://arxiv.org/abs/2604.12049

作者：Shreeya Verma Kathuria,Nitin Mayande,Sharookh Daruwalla,Nitin Joglekar,Charles Weber

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Context Assessment Summary, Semantic Context Assessment, Large Language, enterprise-grade analytics

备注：

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model's attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.12049 [cs.CL]

(or
arXiv:2604.12049v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.12049

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

90. 【2604.12047】Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

链接：https://arxiv.org/abs/2604.12047

作者：Omar El Bachyr,Yewei Song,Saad Ezzini,Jacques Klein,Tegawendé F. Bissyandé,Anas Zilali,Ulrick Ble,Anne Goujon

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：files are primarily, primarily intended, intended for human, human reading, automated PDF processing

备注： 12 pages

点击查看摘要

Abstract:PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

91. 【2604.12046】hink Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

链接：https://arxiv.org/abs/2604.12046

作者：Xin Liu,Lu Wang

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, Large, language models, generation

备注：

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

92. 【2604.12033】Benchmarking Deflection and Hallucination in Large Vision-Language Models

链接：https://arxiv.org/abs/2604.12033

作者：Nicholas Moratelli,Christopher Davis,Leonardo F. R. Ribeiro,Bill Byrne,Gonzalo Iglesias

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, Large Vision-Language, increasingly rely, Large, answer knowledge-intensive multimodal

备注： Accepted to ACL 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

93. 【2604.12018】LLMs Struggle with Abstract Meaning Comprehension More Than Expected

链接：https://arxiv.org/abs/2604.12018

作者：Hamoud Alhazmi,Jiachen Jiang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Understanding abstract meanings, Understanding abstract, crucial for advanced, advanced language comprehension, Understanding

备注：

点击查看摘要

Abstract:Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models' ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

94. 【2604.12015】UCS: Estimating Unseen Coverage for Improved In-Context Learning

链接：https://arxiv.org/abs/2604.12015

作者：Jiayi Xin,Xiang Li,Evan Qiang,Weiqing He,Tianqi Shang,Weijie J. Su,Qi Long

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：performance depends critically, existing selectors prioritize, selectors prioritize heuristic, prioritize heuristic notions, In-context learning

备注： ACL 2026 Findings; 17 pages, 3 figures

点击查看摘要

Abstract:In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good--Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at this https URL.

95. 【2604.12002】Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

链接：https://arxiv.org/abs/2604.12002

作者：Yinghui He,Simran Kaur,Adithya Bhaskar,Yongjin Yang,Jiarui Liu,Narutatsu Ri,Liam Fowl,Abhishek Panigrahi,Danqi Chen,Sanjeev Arora

类目：Computation and Language (cs.CL)

关键词：Current post-training methods, verifiable settings fall, Current post-training, verifiable settings, settings fall

备注：

点击查看摘要

Abstract:Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

96. 【2604.11996】Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

链接：https://arxiv.org/abs/2604.11996

作者：Manas Pathak,Xingyao Chen,Shuozhe Li,Amy Zhang,Liu Leqi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：trust Large Language, Large Language Models, Large Language, trust Large, Language Models

备注：

点击查看摘要

Abstract:Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: this https URL.

97. 【2604.11970】INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

链接：https://arxiv.org/abs/2604.11970

作者：Somraj Gautam,Anathapindika Dravichi,Gaurav Harit

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Visual Question Answering, Bahasa Indonesia, Question Answering, real-world document images, Table Visual Question

备注： Accepted in ACL 2026 (Findings)

点击查看摘要

Abstract:We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: this https URL}

98. 【2604.11950】AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

链接：https://arxiv.org/abs/2604.11950

作者：Zijie Zhao,Chenyuan Yang,Weidong Wang,Yihan Yang,Ziqi Zhang,Lingming Zhang

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：remain static hypotheses, require manual validation, reports remain static, recent LLM-based agents, recent LLM-based

备注：

点击查看摘要

Abstract:While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.

99. 【2604.11924】GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

链接：https://arxiv.org/abs/2604.11924

作者：Jimin Mun,Chani Jung,Xuhui Zhou,Hyunwoo Kim,Maarten Sap

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：hold significant potential, transform scientific research, LLMs hold significant, hold significant, significant potential

备注： 22 pages, 6 figures

点击查看摘要

Abstract:While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

100. 【2604.11811】M$^\star$: Every Task Deserves Its Own Memory Harness

链接：https://arxiv.org/abs/2604.11811

作者：Wenbo Pan,Shujie Liu,Xiangyang Zhou,Shiwei Zhang,Wanlu Shi,Mirror Xu,Xiaohua Jia

类目：Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language model, Large language, extended interactions, rely on specialized, accumulate and reuse

备注： Preprint

点击查看摘要

Abstract:Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

101. 【2604.11628】Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

链接：https://arxiv.org/abs/2604.11628

作者：Yuqian Wu,Wei Chen,Zhengjun Huang,Junle Chen,Qingxiang Liu,Kai Wang,Xiaofang Zhou,Yuxuan Liang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：long-term dialogue history, complex hierarchical summarization, manage long-term dialogue, Existing conversational memory, memory systems rely

备注： 23 pages, 12 figures

点击查看摘要

Abstract:Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

102. 【2604.10898】ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

链接：https://arxiv.org/abs/2604.10898

作者：David H. Yang,Yuxuan Zhu,Mohammad Mohammadi Amiri,Keerthiram Murugesan,Tejaswini Pedapati,Subhajit Chaudhury,Pin-Yu Chen

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, Large language, generating long intermediate, require generating long, language models

备注：

点击查看摘要

Abstract:Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

103. 【2509.22220】StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

链接：https://arxiv.org/abs/2509.22220

作者：Yuhan Song,Linhao Zhang,Chuhan Wu,Aiwei Liu,Wei Jia,Houfeng Wang,Xiao Zhou

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Prevalent semantic speech, capture linguistic content, Prevalent semantic, semantic speech tokenizers, designed to capture

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL.

信息检索

1. 【2604.12990】Sparse Contrastive Learning for Content-Based Cold Item Recommendation

链接：https://arxiv.org/abs/2604.12990

作者：Gregor Meehan,Johan Pauwels

类目：Information Retrieval (cs.IR)

关键词：recommender systems, collaborative filtering, pervasive challenge, challenge for collaborative, cold-start

备注： Accepted at SIGIR 2026

点击查看摘要

Abstract:Item cold-start is a pervasive challenge for collaborative filtering (CF) recommender systems. Existing methods often train cold-start models by mapping auxiliary item content, such as images or text descriptions, into the embedding space of a CF model. However, such approaches can be limited by the fundamental information gap between CF signals and content features. In this work, we propose to avoid this limitation with purely content-based modeling of cold items, i.e. without alignment with CF user or item embeddings. We instead frame cold-start prediction in terms of item-item similarity, training a content encoder to project into a latent space where similarity correlates with user preferences. We define our training objective as a sparse generalization of sampled softmax loss with the $\alpha$-entmax family of activation functions, which allows for sharper estimation of item relevance by zeroing gradients for uninformative negatives. We then describe how this Sampled Entmax for Cold-start (SEMCo) training regime can be extended via knowledge distillation, and show that it outperforms existing cold-start methods and standard sampled softmax in ranking accuracy. We also discuss the advantages of purely content-based modeling, particularly in terms of equity of item outcomes.

2. 【2604.12965】Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation

链接：https://arxiv.org/abs/2604.12965

作者：Dongqi Fu,Kaushik Rangadurai,Haiyu Lu,Yunchen Pu,Siyang Yuan,Minhui Huang,Yiqun Liu,Golnaz Ghasemiesfeh,Xingfeng He,Fangzhou Xu,Andrew Cui,Vidhoon Viswanathan,Lin Yang,Liang Wang,Jiyan Yang,Chonglin Sun

类目：Information Retrieval (cs.IR)

关键词：numerous large-scale industrial, foundational retrieval models, large-scale industrial retrieval, computational resources, retrieval models

备注： 11 pages, 5 figures

点击查看摘要

Abstract:The increase in data volume, computational resources, and model parameters during training has led to the development of numerous large-scale industrial retrieval models for recommendation tasks. However, effectively and efficiently deploying these large-scale foundational retrieval models remains a critical challenge that has not been fully addressed. Common quick-win solutions for deploying these massive models include relying on offline computations (such as cached user dictionaries) or distilling large models into smaller ones. Yet, both approaches fall short of fully leveraging the representational and inference capabilities of foundational models. In this paper, we explore whether it is possible to learn a hierarchical organization over the memory of foundational retrieval models. Such a hierarchical structure would enable more efficient search by reducing retrieval costs while preserving exactness. To achieve this, we propose jointly learning a hierarchical index using cross-attention and residual quantization for large-scale retrieval models. We also present its real-world deployment at Meta, supporting daily advertisement recommendations for billions of Facebook and Instagram users. Interestingly, we discovered that the intermediate nodes in the learned index correspond to a small set of high-quality data. Fine-tuning the model on this set further improves inference performance, and concretize the concept of "test-time training" within the recommendation system domain. We demonstrate these findings using both internal and public datasets with strong baseline comparisons and hope they contribute to the community's efforts in developing the next generation of foundational retrieval models.

3. 【2604.12471】Beyond Single-Dimension Novelty: How Combinations of Theory, Method, and Results-based Novelty Shape Scientific Impact

链接：https://arxiv.org/abs/2604.12471

作者：Yi Zhao,Yang Chenggang,Yuzhuo Wang,Tong Bao,Zhang Heng,Chengzhi Zhang

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：novelty, scientific impact, Scientific novelty drives, Scientific, research frontier

备注： AII-EEKE 2026

点击查看摘要

4. 【2604.12372】Is Sliding Window All You Need? An Open Framework for Long-Sequence Recommendation

链接：https://arxiv.org/abs/2604.12372

作者：Sayak Chakrabarty,Souradip Pal

类目：Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词：Long interaction histories, modern recommender systems, Long interaction, long sequences, recommender systems

备注： 8 pages, 2 figures

点击查看摘要

Abstract:Long interaction histories are central to modern recommender systems, yet training with long sequences is often dismissed as impractical under realistic memory and latency budgets. This work demonstrates that it is not only practical but also effective-at academic scale. We release a complete, end-to-end framework that implements industrial-style long-sequence training with sliding windows, including all data processing, training, and evaluation scripts. Beyond reproducing prior gains, we contribute two capabilities missing from earlier reports: (i) a runtime-aware ablation study that quantifies the accuracy-compute frontier across windowing regimes and strides, and (ii) a novel k-shift embedding layer that enables million-scale vocabularies on commodity GPUs with negligible accuracy loss. Our implementation trains reliably on modest university clusters while delivering competitive retrieval quality (e.g., up to +6.04% MRR and +6.34% Recall@10 on Retailrocket) with $\sim 4 \times $ training-time overheads. By packaging a robust pipeline, reporting training time costs, and introducing an embedding mechanism tailored for low-resource settings, we transform long-sequence training from a closed, industrial technique into a practical, open, and extensible methodology for the community.

5. 【2604.12298】Deep Situation-Aware Interaction Network for Click-Through Rate Prediction

链接：https://arxiv.org/abs/2604.12298

作者：Yimin Lv,Shuli Wang,Beihong Jin,Yisong Yu,Yapeng Zhang,Jian Dong,Yongkang Wang,Xingxing Wang,Dong Wang

类目：Information Retrieval (cs.IR)

关键词：Click-Through Rate, sequence modeling plays, behavior sequence modeling, User behavior sequence, user behavior sequences

备注： RecSys'23 Full Paper

点击查看摘要

Abstract:User behavior sequence modeling plays a significant role in Click-Through Rate (CTR) prediction on e-commerce platforms. Except for the interacted items, user behaviors contain rich interaction information, such as the behavior type, time, location, etc. However, so far, the information related to user behaviors has not yet been fully exploited. In the paper, we propose the concept of a situation and situational features for distinguishing interaction behaviors and then design a CTR model named Deep Situation-Aware Interaction Network (DSAIN). DSAIN first adopts the reparameterization trick to reduce noise in the original user behavior sequences. Then it learns the embeddings of situational features by feature embedding parameterization and tri-directional correlation fusion. Finally, it obtains the embedding of behavior sequence via heterogeneous situation aggregation. We conduct extensive offline experiments on three real-world datasets. Experimental results demonstrate the superiority of the proposed DSAIN model. More importantly, DSAIN has increased the CTR by 2.70\%, the CPM by 2.62\%, and the GMV by 2.16\% in the online A/B test. Now, DSAIN has been deployed on the Meituan food delivery platform and serves the main traffic of the Meituan takeout app.

6. 【2604.12234】UniRec: Bridging the Expressive Gap between Generative and Discriminative Recommendation via Chain-of-Attribute

链接：https://arxiv.org/abs/2604.12234

作者：Ziliang Wang,Gaoyun Lin,Xuesi Wang,Shaoqiang Liang,Yili Huang,Weijie Bian

类目：Information Retrieval (cs.IR)

关键词：Semantic IDs, Generative Recommendation, reframes retrieval, unifying the multi-stage, multi-stage pipeline

备注：

点击查看摘要

Abstract:Generative Recommendation (GR) reframes retrieval and ranking as autoregressive decoding over Semantic IDs (SIDs), unifying the multi-stage pipeline into a single model. Yet a fundamental expressive gap persists: discriminative models score items with direct feature access, enabling explicit user-item crossing, whereas GR decodes over compact SID tokens without item-side signal. We formalize this via Bayes' theorem, showing ranking by p(y|f,u) is equivalent to ranking by p(f|y,u), which factorizes autoregressively over item features. This establishes that a generative model with full feature access is as expressive as its discriminative counterpart; any practical gap stems solely from incomplete feature coverage. We propose UniRec with Chain-of-Attribute (CoA) as its core mechanism. CoA prefixes each SID sequence with structured attribute tokens--category, seller, brand--before decoding the SID itself, recovering the item-side feature crossing that discriminative models exploit. Because items sharing identical attributes cluster in adjacent SID regions, attribute conditioning yields a measurable per-step entropy reduction H(s_k|s_{k},a) H(s_k|s_{k}), narrowing the search space and stabilizing beam search trajectories. We further address two deployment challenges: Capacity-constrained SID introduces exposure-weighted capacity penalties into residual quantization to suppress token collapse and the Matthew effect across SID layers; Conditional Decoding Context (CDC) combines Task-Conditioned BOS with hash-based Content Summaries, injecting scenario-conditioned signals at each decoding step. A joint RFT and DPO framework aligns the model with business objectives beyond distribution matching. Experiments show UniRec outperforms the strongest baseline by +22.6% HR@50 overall and +15.5% on high-value orders, with online A/B tests confirming significant business metric gains.

7. 【2604.12231】hought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

链接：https://arxiv.org/abs/2604.12231

作者：Tao Feng,Pengrui Han,Guanyu Lin,Ge Liu,Jiaxuan You

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large language models, powerful internal capabilities, Large language, language models, transformed AI research

备注：

点击查看摘要

8. 【2604.12201】AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

链接：https://arxiv.org/abs/2604.12201

作者：Hongru Song,Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng

类目：Information Retrieval (cs.IR)

关键词：enhances large language, large language model, Retrieval-augmented generation, retrieving external documents, enhances large

备注： 6 pages,accepted by SIGIR 2026 short paper

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language model (LLM) reasoning by retrieving external documents, but also opens up new attack surfaces. We study knowledge-base poisoning attacks in RAG, where an attacker injects malicious content into the retrieval corpus, which is then naturally surfaced by the retriever and consumed by the LLM during reasoning. Unlike prior work that floods the corpus with poisoned documents, we propose AdversarialCoT, a query-specific attack that poisons only a single document in the corpus. AdversarialCoT first extracts the target LLM's reasoning framework to guide the construction of an initial adversarial chain-of-thought (CoT). The adversarial document is iteratively refined through interactions with the LLM, progressively exposing and exploiting critical reasoning vulnerabilities. Experiments on benchmark LLMs show that a single adversarial document can significantly degrade reasoning accuracy, revealing subtle yet impactful weaknesses. This study exposes security risks in RAG systems and provides actionable insights for designing more robust LLM reasoning pipelines.

9. 【2604.12179】AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

链接：https://arxiv.org/abs/2604.12179

作者：Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Quan Z. Sheng

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, memories remain difficult, remain difficult due

备注： 13 pages, 5 figures, 5 tables

点击查看摘要

10. 【2604.12138】Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2604.12138

作者：Aditya Agrawal,Alwarappan Nakkiran,Darshan Fofadiya,Alex Karlsson,Harsha Aduri

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：LLMs access external, current implementations exhibit, prioritize objective retrieval, access external knowledge, prioritize objective

备注： 13 pages, Preprint under review

点击查看摘要

11. 【2604.12099】he Effect of Document Selection on Query-focused Text Analysis

链接：https://arxiv.org/abs/2604.12099

作者：Sandesh S Rangreji,Mian Zhong,Anjalie Field

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：computational constraints preclude, constraints preclude analyzing, selection strategy choices, strategy choices, document collections

备注：

点击查看摘要

12. 【2604.12047】Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

链接：https://arxiv.org/abs/2604.12047

作者：Omar El Bachyr,Yewei Song,Saad Ezzini,Jacques Klein,Tegawendé F. Bissyandé,Anas Zilali,Ulrick Ble,Anne Goujon

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：files are primarily, primarily intended, intended for human, human reading, automated PDF processing

备注： 12 pages

点击查看摘要

13. 【2604.12036】Constant-Factor Approximation for the Uniform Decision Tree

链接：https://arxiv.org/abs/2604.12036

作者：Michał Szyfelbein

类目：Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：uniform probability distribution, long-standing open question, constant-factor approximation algorithm, Decision Tree, resolve a long-standing

备注： 10 pages

点击查看摘要

Abstract:We resolve a long-standing open question, about the existence of a constant-factor approximation algorithm for the average-case \textsc{Decision Tree} problem with uniform probability distribution over the hypotheses. We answer the question in the affirmative by providing a simple polynomial-time algorithm with approximation ratio of $\frac{2}{1-\sqrt{(e+1)/(2e)}}+\epsilon 11.57$. This improves upon the currently best-known, greedy algorithm which achieves $O(\log n/{\log\log n})$-approximation. The first key ingredient in our analysis is the usage of a decomposition technique known from problems related to \textsc{Hierarchical Clustering} [SODA '17, WALCOM '26], which allows us to decompose the optimal decision tree into a series of objects called separating subfamilies. The second crucial idea is to reduce the subproblem of finding a \textsc{Separating Subfamily} to an instance of the \textsc{Maximum Coverage} problem. To do so, we analyze the properties of cutting cliques into small pieces, which represent pairs of hypotheses to be separated. This allows us to obtain a good approximation for the \textsc{Separating Subfamily} problem, which then enables the design of the approximation algorithm for the original problem.

计算机视觉

1. 【2604.13036】Lyra 2.0: Explorable Generative 3D Worlds

链接：https://arxiv.org/abs/2604.13036

作者：Tianchang Shen,Sherwin Bahmani,Kai He,Sangeetha Grama Srinivasan,Tianshi Cao,Jiawei Ren,Ruilong Li,Zian Wang,Nicholas Sharp,Zan Gojcic,Sanja Fidler,Jiahui Huang,Huan Ling,Jun Gao,Xuanchi Ren

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, simulate scene walkthroughs, feed-forward reconstruction techniques, reconstruction techniques, generating camera-controlled videos

备注： Project Page: [this https URL](https://research.nvidia.com/labs/sil/projects/lyra2/)

点击查看摘要

Abstract:Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

2. 【2604.13035】SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

链接：https://arxiv.org/abs/2604.13035

作者：Kathakoli Sengupta,Kai Ao,Paola Cascante-Bonilla

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, increasingly generate indoor, Language Models, making judgments sensitive

备注： Project Page: [this https URL](https://lab-spell.github.io/SceneCritic/)

点击查看摘要

3. 【2604.13030】Generative Refinement Networks for Visual Synthesis

链接：https://arxiv.org/abs/2604.13030

作者：Jian Han,Jinlai Liu,Jiahuan Wang,Bingyue Peng,Zehuan Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：uniform computational effort, diffusion models dominate, computationally inefficient, applying a uniform, Generative Refinement Networks

备注： code: [this https URL](https://github.com/MGenAI/GRN)

点击查看摘要

Abstract:While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

4. 【2604.13029】Visual Preference Optimization with Rubric Rewards

链接：https://arxiv.org/abs/2604.13029

作者：Ya-Qi Yu,Fangyu Hong,Xiangyang Qu,Hao Wang,Gaojie Wu,Qiaoyu Luo,Nuo Xu,Huixin Wang,Wuheng Xu,Yongxin Liao,Zihao Chen,Haonan Li,Ziming Li,Dezhi Peng,Minghui Liao,Jihao Wu,Haoyu Ren,Dandan Tu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Direct Preference Optimization, effectiveness of Direct, Direct Preference, multimodal tasks, Preference Optimization

备注：

点击查看摘要

Abstract:The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

5. 【2604.13028】Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

链接：https://arxiv.org/abs/2604.13028

作者：Baris Sarper Tezcan,Hrishikesh Viswanath,Rubab Saher,Daniel Aliaga

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：areas are increasingly, increasingly vulnerable, driven by rapid, rapid urbanization, thermal extremes driven

备注： Accepted to the CVPR 2026 EarthVision Workshop

点击查看摘要

Abstract:Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

6. 【2604.13021】Representation geometry shapes task performance in vision-language modeling for CT enterography

链接：https://arxiv.org/abs/2604.13021

作者：Cristian Minoccheri,Emily Wittrup,Kayvan Najarian,Ryan Stidham

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：assessing inflammatory bowel, support automated analysis, Computed tomography, inflammatory bowel disease, assessing inflammatory

备注：

点击查看摘要

Abstract:Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

7. 【2604.13019】See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

链接：https://arxiv.org/abs/2604.13019

作者：Himangi Mittal,Gaurav Mittal,Nelson Daniel Troncoso,Yu Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dense IDE elements, executable screen actions, translate language instructions, graphical user interface, dense IDE

备注：

点击查看摘要

Abstract:Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: this https URL.

8. 【2604.12999】Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

链接：https://arxiv.org/abs/2604.12999

作者：Jaywon Koo,Jefferson Hernandez,Ruozhen He,Hanjie Chen,Chen Wei,Vicente Ordonez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hypothesis-driven scientific inquiry, formulates neural architecture, scientific inquiry, visual recognition, hypothesis-driven scientific

备注：

点击查看摘要

Abstract:We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

9. 【2604.12978】GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

链接：https://arxiv.org/abs/2604.12978

作者：Amir Hossein Kargaran,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical character recognition, Optical character, cluster of high, advanced rapidly, evaluation has remained

备注：

点击查看摘要

10. 【2604.12969】AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

链接：https://arxiv.org/abs/2604.12969

作者：Yubraj Bhandari,Lavsen Dahal,Paul Segars,Joseph Y. Lo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical imaging research, variations remain limited, clinically meaningful anatomical, imaging research, generate controlled

备注：

点击查看摘要

Abstract:Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6\% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

11. 【2604.12968】Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

链接：https://arxiv.org/abs/2604.12968

作者：Tong Zhang,Jiangning Zhang,Zhucun Xue,Juntao Jiang,Yicheng Xu,Chengming Xu,Teng Hu,Xingyu Xie,Xiaobin Hu,Yabiao Wang,Yong Liu,Shuicheng Yan

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Balancing convergence speed, Balancing convergence, computational efficiency remains, generalization capability, convergence speed

备注：

点击查看摘要

Abstract:Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at this https URL.

12. 【2604.12966】Boosting Visual Instruction Tuning with Self-Supervised Guidance

链接：https://arxiv.org/abs/2604.12966

作者：Sophia Sirko-Galouchenko,Monika Wysoczanska,Andrei Bursuc,Nicolas Thome,Spyros Gidaris

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, Multimodal large, large language models, Multimodal, instruction tuning

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: this https URL

13. 【2604.12945】Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks

链接：https://arxiv.org/abs/2604.12945

作者：Amar Gahir,Varshil Patel,Shreyank N Gowda

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep neural networks, uniformly sampling large, sampling large datasets, samples contribute equally, Deep neural

备注：

点击查看摘要

Abstract:Deep neural networks are typically trained by uniformly sampling large datasets across epochs, despite evidence that not all samples contribute equally throughout learning. Recent work shows that progressively reducing the amount of training data can improve efficiency and generalization, but existing methods rely on fixed schedules that do not adapt during training. In this work, we propose Adaptive Data Dropout, a simple framework that dynamically adjusts the subset of training data based on performance feedback. Inspired by self-regulated learning, our approach treats data selection as an adaptive process, increasing or decreasing data exposure in response to changes in training accuracy. We introduce a lightweight stochastic update mechanism that modulates the dropout schedule online, allowing the model to balance exploration and consolidation over time. Experiments on standard image classification benchmarks show that our method reduces effective training steps while maintaining competitive accuracy compared to static data dropout strategies. These results highlight adaptive data selection as a promising direction for efficient and robust training. Code will be released.

14. 【2604.12944】Distorted or Fabricated? A Survey on Hallucination in Video LLMs

链接：https://arxiv.org/abs/2604.12944

作者：Yiyang Huang,Yitian Zhang,Yizhou Wang,Mingyuan Zhang,Liang Shi,Huimin Zeng,Yun Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Video Large Language, Language Models, Large Language, Video Large

备注： ACL 2026 findings

点击查看摘要

Abstract:Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at this https URL .

15. 【2604.12941】Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

链接：https://arxiv.org/abs/2604.12941

作者：Tianshuo Zhang,Haoyuan Zhang,Siran Peng,Weisong Zhao,Xiangyu Zhu,Zhen Lei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Continual face forgery, learn emerging forgery, emerging forgery paradigms, requires detectors, previously seen manipulations

备注：

点击查看摘要

Abstract:Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

16. 【2604.12935】ask Alignment: A simple and effective proxy for model merging in computer vision

链接：https://arxiv.org/abs/2604.12935

作者：Pau de Jorge,César Roberto de Souza,Björn Michele,Mert Bülent Sarıyıldız,Philippe Weinzaepfel,Florent Perronnin,Diane Larlus,Yannis Kalantidis

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：great practical interest, Efficiently merging, pretrained base model, pretrained base, Efficiently

备注：

点击查看摘要

Abstract:Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

17. 【2604.12933】DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding

链接：https://arxiv.org/abs/2604.12933

作者：Yuhan Jin,Nayari Marie Lessa,Mariela De Lucas Alvarez,Melvin Laux,Lucas Amparo Barbosa,Frank Kirchner,Rebecca Adam

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Marine ecosystem degradation, ecosystem degradation necessitates, scientifically selective underwater, selective underwater monitoring, Marine ecosystem

备注：

点击查看摘要

Abstract:Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.

18. 【2604.12929】Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

链接：https://arxiv.org/abs/2604.12929

作者：Ayce Idil Aytekin,Xu Chen,Zhengyang Shen,Thabo Beeler,Helge Rhodin,Rishabh Dabral,Christian Theobalt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：present Grasp, single monocular video, reconstructing dynamic, Grasp in Gaussians, robust method

备注： Project page: [this https URL](https://aidilayce.github.io/GraG-page/)

点击查看摘要

Abstract:We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

19. 【2604.12923】Pi-HOC: Pairwise 3D Human-Object Contact Estimation

链接：https://arxiv.org/abs/2604.12923

作者：Sravan Chittupalli,Ayush Jain,Dong Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Resolving real-world human-object, disentangling fine-grained concurrent, fine-grained concurrent physical, Resolving real-world, real-world human-object interactions

备注：

点击查看摘要

Abstract:Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

20. 【2604.12918】Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

链接：https://arxiv.org/abs/2604.12918

作者：Ahmet İnanç,Özgür Erkent

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：physical coordinate system, unified spatial canvas, perception in autonomous, autonomous driving, coordinate system

备注： 8 pages, 5 figures, 3 Tables, submitted to a venue for consideration

点击查看摘要

Abstract:Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

21. 【2604.12917】M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

链接：https://arxiv.org/abs/2604.12917

作者：Deqing Yang,Yingying Liu,Qicong Wang,Zhi Zeng,Dajiang Lu,Yibin Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：severe information loss, highly challenging problem, challenging problem due, adverse conditions, remains a highly

备注：

点击查看摘要

Abstract:Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

22. 【2604.12904】A Sanity Check on Composed Image Retrieval

链接：https://arxiv.org/abs/2604.12904

作者：Yikun Liu,Jiangchao Yao,Weidi Xie,Yanfeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Composed Image Retrieval, target image based, Image Retrieval, Composed Image, query composed

备注：

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

23. 【2604.12896】Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

链接：https://arxiv.org/abs/2604.12896

作者：Muhammad Kamran Janjua,Hugo Silva,Di Niu,Bahador Rashidi

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Multimodal language models, Multimodal language, increasingly paired, enhance visual reasoning, Multimodal

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

24. 【2604.12894】Representing 3D Faces with Learnable B-Spline Volumes

链接：https://arxiv.org/abs/2604.12894

作者：Prashanth Chandran,Daoye Wang,Timo Bolkart

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Control-based Unified B-spline, Unified B-spline Encoding, Control-based Unified, combines B-spline volumes, B-spline Encoding

备注： Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

25. 【2604.12890】owards Long-horizon Agentic Multimodal Search

链接：https://arxiv.org/abs/2604.12890

作者：Yifan Du,Zikang Liu,Jinbiao Peng,Jie Wu,Junyi Li,Jinyang Li,Wayne Xin Zhao,Ji-Rong Wen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shown great potential, Multimodal deep search, iteratively collecting textual, solving complex tasks, Multimodal deep

备注：

点击查看摘要

Abstract:Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in this https URL.

26. 【2604.12887】VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

链接：https://arxiv.org/abs/2604.12887

作者：Andrei Atanov,Jesse Allardice,Roman Bachmann,Oğuzhan Fatih Kar,R Devon Hjelm,David Griffiths,Peter Fu,Afshin Dehghan,Amir Zamir

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Visual tokenizers map, map high-dimensional raw, high-dimensional raw pixels, tokenizers map high-dimensional, Visual tokenizers

备注： project page at [this https URL](https://videoflextok.epfl.ch/)

点击查看摘要

Abstract:Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

Comments:
project page at this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2604.12887 [cs.CV]

(or
arXiv:2604.12887v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.12887

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

27. 【2604.12856】PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

链接：https://arxiv.org/abs/2604.12856

作者：Xuan Wang,Kai Ruan,Jiayi Han,kaiyue Zhou,Gaoang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Audio-driven bimanual piano, requires precise modeling, complex musical structures, bimanual piano motion, Audio-driven bimanual

备注：

点击查看摘要

Abstract:Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

28. 【2604.12833】Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

链接：https://arxiv.org/abs/2604.12833

作者：Yingying Zhao,Chengyin Hu,Qike Zhang,Xin Li,Xin Wang,Yiwei Wei,Jiujiang Guo,Jiahuan Long,Tingsong Jiang,Wen Yao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains insufficiently understood, Vision-Language Models, security remains insufficiently, shown remarkable performance, insufficiently understood

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

29. 【2604.12832】Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

链接：https://arxiv.org/abs/2604.12832

作者：Iman Islam,Bram Ruijsink,Andrew J. Reader,Andrew P. King

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Deep learning-based medical, learning-based medical image, medical image segmentation, image segmentation typically, segmentation typically relies

备注： 5 pages, 3 figures, 2 tables, International Symposium on Biomedical Imaging 2026

点击查看摘要

Abstract:Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

30. 【2604.12813】DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

链接：https://arxiv.org/abs/2604.12813

作者：Xinyue Li,Shubo Xu,Zhichao Zhang,Zhaolin Cai,Yitong Chen,Guangtao Zhai

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Recent multimodal large, large language models, multimodal large language, Recent multimodal, shown promising performance

备注：

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.

31. 【2604.12807】Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

链接：https://arxiv.org/abs/2604.12807

作者：Adrien Dorise,Marjorie Bellizzi,Omar Hlimi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：noise and blur, compensating for degradations, image restoration aims, Convolutional Board-ready Embedded, restoration

备注： AI4SPACE@CVPR conference

点击查看摘要

Abstract:Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.

Comments:
AI4SPACE@CVPR conference

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.12807 [cs.CV]

(or
arXiv:2604.12807v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.12807

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

32. 【2604.12805】Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

链接：https://arxiv.org/abs/2604.12805

作者：Feiyu Tan,Heran Yang,Qihong Duan,Kai Ye,Qi Xie,Deyu Meng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：adapting domain-specific attributes, preserving domain-invariant features, computer vision, focused on mapping, domain-specific attributes

备注： 17 pages, 8 figures, submiting to TPAMI

点击查看摘要

Abstract:Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at this https URL

33. 【2604.12803】Generative Anonymization in Event Streams

链接：https://arxiv.org/abs/2604.12803

作者：Adam T. Müller,Mihai Kocsis,Nicolaj C. Stache

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：high dynamic range, sensors offer low, offer low latency, public spaces raises, spaces raises severe

备注： Accepted to the 1st Workshop on Low-Level Vision Frontiers (LoViF) at IEEE/CVF CVPR 2026

点击查看摘要

Abstract:Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

34. 【2604.12781】Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

链接：https://arxiv.org/abs/2604.12781

作者：Haoyang Jiang,Mingyang Yi,Shaolei Zhang,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted increasing attention, increasing attention due, detecting AI-generated images, AI-generated images produced, detecting AI-generated

备注：

点击查看摘要

Abstract:Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

35. 【2604.12780】Efficient Adversarial Training via Criticality-Aware Fine-Tuning

链接：https://arxiv.org/abs/2604.12780

作者：Wenyun Li,Zheng Zhang,Dongmei Jiang,Yaowei Wang,Xiangyuan Lan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved remarkable performance, Adversarial training, achieved remarkable, remarkable performance, key advantage

备注：

点击查看摘要

Abstract:Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

36. 【2604.12777】Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

链接：https://arxiv.org/abs/2604.12777

作者：Huanzhen Wang,Ziheng Zhou,Zeng Tao,Aoxing Li,Yingkai Zhao,Yuxuan Lin,Yan Wang,Wenqiang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：constructs emotional percepts, Dual-stream Semantic Enhancement, human brain constructs, Hierarchical Temporal Prompt, Temporal Prompt Cluster

备注： Accepted by IEEE ICRA 2026

点击查看摘要

Abstract:The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

37. 【2604.12772】A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

链接：https://arxiv.org/abs/2604.12772

作者：Madeline Anderson,Mikhail Klassen,Ash Hoover,Kerri Cahoy

类目：Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：multiple time steps, occur over multiple, time steps, satellite imagery, satellite

备注：

点击查看摘要

Abstract:Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

38. 【2604.12767】CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

链接：https://arxiv.org/abs/2604.12767

作者：Yunkai Dang,Yizhu Jiang,Yifan Jiang,Qi Fan,Yinghuan Shi,Wenbin Li,Yang Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at this https URL.

39. 【2604.12765】A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

链接：https://arxiv.org/abs/2604.12765

作者：Yeeun Park,Miqdad Naduthodi,Suryansh Kumar

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：markers limits scalability, Marker-based motion capture, human motion capture, Marker-based motion, motion capture

备注： 14 pages, 11 figures, 4 tables. Accepted for publication at CVPR 2026 4D World Models Workshop

点击查看摘要

Abstract:Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

40. 【2604.12762】ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

链接：https://arxiv.org/abs/2604.12762

作者：Myungchul Kim,Kwanyong Park,Junmo Kim,In So Kweon

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：reformulates multi-camera person, multi-camera person search, interactive reasoning problem, reasoning problem requiring, ARGOS agent receives

备注： Accepted to CVPR 2026 Workshop on Multimodal Spatial Intelligence (MUSI)

点击查看摘要

Abstract:We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

41. 【2604.12752】Scaling In-Context Segmentation with Hierarchical Supervision

链接：https://arxiv.org/abs/2604.12752

作者：T. Camaret Ndir,Marco Reisert,Robin T. Schirrmeister

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinical annotation burden, In-context learning, enables medical image, enables medical, annotation burden

备注：

点击查看摘要

Abstract:In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44\% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at this https URL

42. 【2604.12735】AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

链接：https://arxiv.org/abs/2604.12735

作者：Zeheng Wang,Zitong Yu,Yijie Zhu,Bo Zhao,Haochen Liang,Taorui Wang,Wei Xia,Jiayu Zhang,Zhishu Liu,Hui Ma,Fei Ma,Qi Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：static parametric memory, nuanced affective states, LLM-based multimodal emotion, interpreting nuanced affective, multimodal emotion recognition

备注：

点击查看摘要

Abstract:LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: this https URL.

43. 【2604.12709】Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging

链接：https://arxiv.org/abs/2604.12709

作者：Xinyu Peng,Ziyang Zheng,Wenrui Dai,Duoduo Xue,Shaohui Li,Chenglin Li,Junni Zou,Hongkai Xiong

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：magnetic resonance imaging, compressed sensing magnetic, sensing magnetic resonance, Task-adapted compressed sensing, required by Nyquist

备注： 68 pages, 15 figures, accepted by IEEE TPAMI

点击查看摘要

Abstract:Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).

44. 【2604.12693】Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI

链接：https://arxiv.org/abs/2604.12693

作者：Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep learning models, achieve expert-level accuracy, medical image classification, Deep learning, semantic incoherence

备注： This work has been accepted for publication in the Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN 2026). The final published version should be cited

点击查看摘要

Abstract:Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

45. 【2604.12683】Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

链接：https://arxiv.org/abs/2604.12683

作者：Junfeng Xia,Wenhao Ye,Xuanye Pan,Xinke Shen,Mo Wang,Quanying Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词：Current fMRI foundation, diverse brain states, fMRI foundation models, Current fMRI, foundation models primarily

备注：

点击查看摘要

Abstract:Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{this https URL}{Link}.

46. 【2604.12668】OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

链接：https://arxiv.org/abs/2604.12668

作者：Haoyang Jiang,Zekun Wang,Mingyang Yi,Xiuyu Li,Lanqing Hu,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Probabilistic Model, Diffusion Probabilistic, Probabilistic Model, computational overhead hinder, achieves remarkable performance

备注：

点击查看摘要

Abstract:The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

47. 【2604.12665】Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

链接：https://arxiv.org/abs/2604.12665

作者：Zikai Song,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang,Xinchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables consistent association, cornerstone of multi-object, consistent association, Motion reasoning serves, Motion

备注：

点击查看摘要

Abstract:Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

48. 【2604.12652】PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

链接：https://arxiv.org/abs/2604.12652

作者：Jinlong Liu,Wanggui He,Peng Zhang,Mushui Liu,Hao Jiang,Pipei Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：CLIP Score, signals remains challenging, costly human-annotated preference, human-annotated preference data, Reinforcement learning

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

49. 【2604.12650】Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

链接：https://arxiv.org/abs/2604.12650

作者：Miao Liu,Fangda Wei,Jing Wang,Xinyuan Qian

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：generating fabricated content, deepfake detection research, deepfake detection, Speaking Deepfake Detection, existing Speaking Deepfake

备注： Submitted to ACMMM 2026

点击查看摘要

Abstract:Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at this https URL.

50. 【2604.12630】GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

链接：https://arxiv.org/abs/2604.12630

作者：Zhaochen Liu,Limeng Qiao,Guanglu Wan,Tingting Jiang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal large language, Multimodal large, large language models, exhibited remarkable performance, large language

备注：

点击查看摘要

51. 【2604.12626】Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

链接：https://arxiv.org/abs/2604.12626

作者：Ziyuan Xia,Jingyi Xu,Chong Cui,Yuanhong Yu,Jiazhao Zhang,Qingsong Yan,Tao Ni,Junbo Chen,Xiaowei Zhou,Hujun Bao,Ruizhen Hu,Sida Peng

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：model dynamic humans, agents depends critically, depends critically, fidelity of simulation, simulation environments

备注： Project page: [this https URL](https://zju3dv.github.io/habitat-gs/)

点击查看摘要

Abstract:Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the most effective strategy. Evaluations on avatar-aware navigation further confirm that gaussian avatars enable effective human-aware navigation. Finally, performance benchmarks validate the system's scalability across varying scene complexity and avatar counts.

52. 【2604.12622】Efficient Semantic Image Communication for Traffic Monitoring at the Edge

链接：https://arxiv.org/abs/2604.12622

作者：Damir Assylbek,Nurmukhammed Aitymbetov,Marko Ristin,Dimitrios Zorbas

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)

关键词：strict communication constraints, monitoring systems operate, transmitting full-resolution images, visual monitoring systems, systems operate

备注：

点击查看摘要

Abstract:Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

53. 【2604.12600】Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

链接：https://arxiv.org/abs/2604.12600

作者：Xuelin Xie,Xiliang Lu,Zhengshan Wang,Yang Zhang,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词：noise prior modeling, hyperspectral image denoising, core challenge, challenge of hyperspectral, prior modeling

备注：

点击查看摘要

Abstract:The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

54. 【2604.12592】ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

链接：https://arxiv.org/abs/2604.12592

作者：Yuhao Liu,Dingju Wang,Ziyang Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Optimized Gaussian Splatting, degraded multi-view inputs, photorealistic Gaussian Splatting, Gaussian Splatting, Low-light Optimized Gaussian

备注： Our method achieved a ranking of 9 out of 148 participants in Track 1 of the NTIRE 3DRR Challenge, as reported on the official competition website: [this https URL](https://www.codabench.org/competitions/13854/)

点击查看摘要

Abstract:This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at this https URL.

55. 【2604.12582】Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

链接：https://arxiv.org/abs/2604.12582

作者：Zijian Liu,Sihan Cao,Pengcheng Zheng,Kuien Liu,Caiyan Qin,Xiaolin Qin,Jiwei Wei,Chaoning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Recent Video Large, Video Large Language, Large Language, demonstrated strong capability

备注：

点击查看摘要

Abstract:Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

56. 【2604.12580】PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

链接：https://arxiv.org/abs/2604.12580

作者：Kangmin Seo,MinKyu Lee,Tae-Young Kim,ByeongCheol Lee,JoonSeoung An,Jae-Pil Heo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-time photorealistic rendering, enabled impressive real-time, impressive real-time photorealistic, Recent advances, Gaussian Splatting

备注： Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at this https URL.

57. 【2604.12575】StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

链接：https://arxiv.org/abs/2604.12575

作者：Yinxi He,Kang Liao,Chunyu Lin,Tianyi Wei,Yao Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generative framework based, single-scale diffusion model, generative framework, single-scale diffusion, single-image generation

备注： Accepted by IEEE Transactions on Multimedia (Regular Paper)

点击查看摘要

Abstract:This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at this https URL.

58. 【2604.12574】Cross-Modal Knowledge Distillation for PET-Free Amyloid-Beta Detection from MRI

链接：https://arxiv.org/abs/2604.12574

作者：Francesco Chiumento,Julia Dietlmeier,Ronan P. Killeen,Kathleen M. Curran,Noel E. O'Connor,Mingming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires PET imaging, typically requires PET, Detecting amyloid, diagnosis of Alzheimer, Alzheimer disease

备注： Accepted to CVPR Workshops 2026 (PHAROS-AIF-MIH)

点击查看摘要

Abstract:Detecting amyloid-$\beta$ (A$\beta$) positivity is crucial for early diagnosis of Alzheimer's disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A$\beta$ prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A$\beta$ screening. Code is available at this https URL.

59. 【2604.12568】Evolution-Inspired Sample Competition for Deep Neural Network Optimization

链接：https://arxiv.org/abs/2604.12568

作者：Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：training generally optimizes, deep network training, uniform learning paradigm, network training generally, largely uniform learning

备注：

点击查看摘要

Abstract:Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

60. 【2604.12565】Scalable Trajectory Generation for Whole-Body Mobile Manipulation

链接：https://arxiv.org/abs/2604.12565

作者：Yida Niu,Xinhai Chang,Xin Liu,Ziyuan Jiao,Yixin Zhu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：coordinate whole-body motion, Robots deployed, physical world, whole-body motion, simultaneously moving

备注：

点击查看摘要

Abstract:Robots deployed in unstructured environments must coordinate whole-body motion -- simultaneously moving a mobile base and arm -- to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over $80\times$ faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach $\approx 80\%$ success, confirming that data scarcity -- not algorithmic limitations -- has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

61. 【2604.12551】Cross-Attentive Multiview Fusion of Vision-Language Embeddings

链接：https://arxiv.org/abs/2604.12551

作者：Tomas Berriel Martins,Martin R. Oswald,Javier Civera

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：development of open-vocabulary, semantic segmentation, Vision-language models, Abstract, models

备注：

点击查看摘要

Abstract:Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

62. 【2604.12537】MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

链接：https://arxiv.org/abs/2604.12537

作者：Ruoxiang Huang,Zhen Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：mechanisms remain suboptimal, achieved remarkable progress, encoding mechanisms remain, remain suboptimal, achieved remarkable

备注： Accepted by CVPR 2026 (Highlight). 10 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

63. 【2604.12525】CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

链接：https://arxiv.org/abs/2604.12525

作者：Zhaoyang Jia,Naifu Xue,Zihan Zheng,Jiahao Li,Bin Li,Xiaoyi Zhang,Zongyu Guo,Yuan Zhang,Houqiang Li,Yan Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advanced diffusion, methods typically derive, typically derive strong, Recent advanced, advanced diffusion methods

备注：

点击查看摘要

Abstract:Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85\% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at this https URL

64. 【2604.12512】NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

链接：https://arxiv.org/abs/2604.12512

作者：Guanyi Qin,Jie Liang,Bingbing Zhang,Lishen Qu,Ya-nan Guan,Hui Zeng,Lei Zhang,Radu Timofte,Jianhui Sun,Xinli Yue,Tao Shao,Huan Hou,Wenjie Liao,Shuhao Han,Jieyu Yuan,Chunle Guo,Chongyi Li,Zewen Chen,Yunze Liu,Jian Guo,Juan Wang,Yun Zeng,Bing Li,Weiming Hu,Hesong Li,Dehua Liu,Xinjie Zhang,Qiang Li,Li Yan,Wei Dong,Qingsen Yan,Xingcan Li,Shenglong Zhou,Manjiang Yin,Yinxiang Zhang,Hongbo Wang,Jikai Xu,Zhaohui Fan,Dandan Zhu,Wei Sun,Weixia Zhang,Kun Zhu,Nana Zhang,Kaiwei Zhang,Qianqian Zhang,Zhihan Zhang,William Gordon,Linwei Wu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Cici Liu,Yaokun Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Image Quality Assessment, Quality Assessment, focusing on Track, Professional Image Quality, Conventional Image Quality

备注： NTIRE Challenge Report. Accepted by CVPRW 2026

点击查看摘要

Abstract:In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at this https URL, and the official homepage is accessible at this https URL.

65. 【2604.12509】Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

链接：https://arxiv.org/abs/2604.12509

作者：Snehal Jauhri,Vignesh Prasad,Georgia Chalvatzaki

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：base and arms, Classical whole-body controllers, opening doors, demands simultaneous, articulated objects

备注： PrePrint. Project website: [this http URL](http://sites.google.com/view/whole-moma)

点击查看摘要

Abstract:Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot's base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.

66. 【2604.12508】From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

链接：https://arxiv.org/abs/2604.12508

作者：Jilong Zhu,Yang Feng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive capabilities

备注：

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

67. 【2604.12502】SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

链接：https://arxiv.org/abs/2604.12502

作者：Junbin Su,Ziteng Xue,Shihui Zhang,Kun Chen,Weiming Hu,Zhipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：fundamentally erodes PEFT, inflated parameter budgets, PEFT efficiency promise, erodes PEFT efficiency, Parameter-efficient fine-tuning

备注： Accepted as a CVPR 2026 Oral

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{this https URL}{\textcolor{cyan}{Code is available}}.

68. 【2604.12481】2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

链接：https://arxiv.org/abs/2604.12481

作者：Nihal Jaiswal,Siddhartha Arjaria,Gyanendra Chaubey,Ankush Kumar,Aditya Singh,Anchal Chaurasiya

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve impressive visual, impressive visual fidelity, amplify demographic imbalances, cultural biases embedded, models achieve impressive

备注：

点击查看摘要

Abstract:Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.12481 [cs.CV]

(or
arXiv:2604.12481v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.12481

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Gyanendra Chaubey [view email] [v1]
Tue, 14 Apr 2026 09:05:12 UTC (38,857 KB)

69. 【2604.12463】Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

链接：https://arxiv.org/abs/2604.12463

作者：Anqi Zhu,Mengting Ma,Yizhen Jiang,Xiangdong Li,Kai Zheng,Jiaxin Li,Wei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：synthesize high-resolution multispectral, Feature Interaction Module, high-resolution multispectral, low-resolution multispectral, Decoupling Neural Operator

备注：

点击查看摘要

Abstract:Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler's formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

70. 【2604.12446】Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

链接：https://arxiv.org/abs/2604.12446

作者：Zida Li,Jun Li,Yuzhe Sha,Ziqiang Li,Lizhi Xiong,Zhangjie Fu

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, backdoor security risks, open ecosystems introduces, image synthesis, security risks

备注： Under Review

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have achieved remarkable success in image synthesis, but their reliance on large-scale data and open ecosystems introduces serious backdoor security risks. Existing defenses, particularly input-level methods, are more practical for deployment but often rely on observable anomalies that become unreliable under stealthy, semantics-preserving trigger designs. As modern backdoor attacks increasingly embed triggers into natural inputs, these methods degrade substantially, raising a critical question: can more stable, implicit, and trigger-agnostic differences between benign and backdoor inputs be exploited for detection? In this work, we address this challenge from an active probing perspective. We introduce controlled scaling perturbations on cross-attention and uncover a novel phenomenon termed Cross-Attention Scaling Response Divergence (CSRD), where benign and backdoor inputs exhibit systematically different response evolution patterns across denoising steps. Building on this insight, we propose SET, an input-level backdoor detection framework that constructs response-offset features under multi-scale perturbations and learns a compact benign response space from a small set of clean samples. Detection is then performed by measuring deviations from this learned space, without requiring prior knowledge of the attack or access to model training. Extensive experiments demonstrate that SET consistently outperforms existing baselines across diverse attack methods, trigger types, and model settings, with particularly strong gains under stealthy implicit-trigger scenarios. Overall, SET improves AUROC by 9.1% and ACC by 6.5% over the best baseline, highlighting its effectiveness and robustness for practical deployment.

71. 【2604.12443】DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

链接：https://arxiv.org/abs/2604.12443

作者：Paschalis Giakoumoglou,Symeon Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern diffusion-based inpainting, pose significant challenges, full regeneration pipelines, regeneration pipelines reconstruct, camera-level noise patterns

备注： CVPRW2026

点击查看摘要

Abstract:Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at this https URL

72. 【2604.12440】IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

链接：https://arxiv.org/abs/2604.12440

作者：Haoyu Zheng,Tianwei Lin,Wei Wang,Zhuonan Wang,Wenqiao Zhang,Jiaqi Zhu,Feifei Shao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Real-world industrial inspection, industrial inspection requires, Real-world industrial, industrial inspection, inspection requires

备注：

点击查看摘要

Abstract:Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by 76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

73. 【2604.12437】A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

链接：https://arxiv.org/abs/2604.12437

作者：Mohammed Asad,Mohit Bajpai,Sudhir Singh,Rahul Katarya

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：suspicious breast lesions, Convolutional Neural Networks, Accurate characterization, treatment planning, characterization of suspicious

备注： 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

74. 【2604.12424】Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

链接：https://arxiv.org/abs/2604.12424

作者：Sihang Jia,Shuliang Liu,Songbo Yang,Yibo Yan,Xin Zou,Xuming Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Models frequently suffer, Language Models frequently, Multimodal Large

备注：

点击查看摘要

75. 【2604.12411】DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

链接：https://arxiv.org/abs/2604.12411

作者：Qiuyu Tian,Haoliang Sun,Yunshan Wang,Yinghuan Shi,Yilong Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep neural networks, neural networks demonstrate, networks demonstrate strong, demonstrate strong generalization, Segmentation models based

备注： 27 pages,6 figures

点击查看摘要

Abstract:Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human--AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

Comments:
27 pages,6 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.12411 [cs.CV]

(or
arXiv:2604.12411v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.12411

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

76. 【2604.12403】Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

链接：https://arxiv.org/abs/2604.12403

作者：Jungwon Choi,Eunwoo Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：adapts vision-language models, Test-Time Prompt Tuning, Prompt Tuning, adapts vision-language, effectiveness is hindered

备注： Accepted by CVPR 2026 findings

点击查看摘要

Abstract:Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

77. 【2604.12391】Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

链接：https://arxiv.org/abs/2604.12391

作者：Jiawei Fan,Shigeng Wang,Chao Li,Xiaolong Liu,Anbang Yao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：vision foundation models, performance-lossless training acceleration, model, training acceleration method, vision foundation

备注： This work is accepted to CVPR 2026. Code is available at [this https URL](https://github.com/deep-optimization/CoM-PT)

点击查看摘要

Abstract:In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

78. 【2604.12380】Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

链接：https://arxiv.org/abs/2604.12380

作者：Hao Wang,Jiqing Zhang,Xin Yang,Baocai Yin,Lu Jiang,Zetian Mi,Huibing Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：exploiting additional visual, Camouflaged Object Detection, additional visual modalities, Object Detection, complex backgrounds

备注： 10

点击查看摘要

Abstract:Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

79. 【2604.12371】Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

链接：https://arxiv.org/abs/2604.12371

作者：Ravikumar Balakrishnan,Sanket Mendapara,Ankit Garg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：camera-equipped embodied agents, bypass safety mechanisms, study typographic prompt, typographic prompt injection, autonomous agents

备注： Accepted at ICLR 2026 Workshop on Agents in the Wild

点击查看摘要

Abstract:We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

80. 【2604.12358】Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

链接：https://arxiv.org/abs/2604.12358

作者：Jiwan Kim,Kibum Kim,Wonjoong Kim,Byung-Kwan Lee,Chanyoung Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注： Preprint, Project : [this https URL](https://ptkjw1997.github.io/DSTP-page/)

点击查看摘要

Abstract:Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

81. 【2604.12357】ReflectCAP: Detailed Image Captioning with Reflective Memory

链接：https://arxiv.org/abs/2604.12357

作者：Kyungmin Min,Minbeom Kim,Kang-il Lee,Seunghyun Yoon,Kyomin Jung

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：image captioning demands, Structured Reflection Notes, Detailed image captioning, called Structured Reflection, achieve them simultaneously

备注：

点击查看摘要

Abstract:Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

82. 【2604.12356】OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

链接：https://arxiv.org/abs/2604.12356

作者：Dongjian Yu,Weiqing Min,Qian Jiang,Xing Lin,Xin Jin,Shuqiang Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：personalized diet management, promoting healthy dietary, healthy dietary habits, diet management, food nutrition plays

备注： Accepted by CVPR 2026 (Highlight Paper)

点击查看摘要

Abstract:Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: this https URL

83. 【2604.12353】Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

链接：https://arxiv.org/abs/2604.12353

作者：Haifeng Zhang,Qinghui He,Xiuli Bi,Bo Liu,Chi-Man Pun,Bin Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：artificial intelligence technology, creating high-quality fake, generative artificial intelligence, high-quality fake images, recent years

备注：

点击查看摘要

Abstract:In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

84. 【2604.12351】Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

链接：https://arxiv.org/abs/2604.12351

作者：Yuzhuo Zhou,Chi Liu,Sheng Shen,Zongyuan Ge,Fengshi Jing,Shiran Zhang,Yu Jiang,Anli Wang,Wenjian Liu,Feilong Yang,Tianqing Zhu,Xiaotong Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：color fundus photography, based on color, photography is essential, Automated diagnosis based, glaucoma screening

备注： 15 pages. In submission to an Elsevier Journal

点击查看摘要

Abstract:Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

85. 【2604.12346】Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

链接：https://arxiv.org/abs/2604.12346

作者：Zanyi Wang,Fan Li,Dengyang Jiang,Liuzhuozheng Li,Yunhua Zhong,Guang Dai,Mengmeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：localize queried objects, dynamic video segments, aims to localize, localize queried, queried objects

备注：

点击查看摘要

Abstract:Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

86. 【2604.12343】Detecting Precise Hand Touch Moments in Egocentric Video

链接：https://arxiv.org/abs/2604.12343

作者：Huy Anh Nguyen,Feras Dayoub,Minh Hoai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：task of detecting, hands make contact, Hand-informed Context Enhanced, challenging task, contact

备注： Accepted to CVPR Findings 2026

点击查看摘要

Abstract:We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

Comments:
Accepted to CVPR Findings 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.12343 [cs.CV]

(or
arXiv:2604.12343v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.12343

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

87. 【2604.12342】CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

链接：https://arxiv.org/abs/2604.12342

作者：Qi Li,Cheng-Long Wang,Yinzhi Cao,Di Wang

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：carefully chosen portion, privacy, carefully chosen, chosen portion, full dataset

备注：

点击查看摘要

Abstract:Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary's knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.

88. 【2604.12341】Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

链接：https://arxiv.org/abs/2604.12341

作者：Xiaojie Liang,Zhimin Chen,Ziqi Sheng,Wei Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：conspicuous forensic artifacts, image editing advances, generative image editing, editing advances, locally realistic

备注：

点击查看摘要

Abstract:As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

89. 【2604.12335】All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

链接：https://arxiv.org/abs/2604.12335

作者：Tanzila Rahman,Renjie Liao,Leonid Sigal

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Training multimodal large, requires large-scale annotated, multimodal large language, Training multimodal, annotated data spanning

备注： 8 Pages, 4 Tables, 4 Figures

点击查看摘要

Abstract:Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

90. 【2604.12331】HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

链接：https://arxiv.org/abs/2604.12331

作者：Ivannia Gomez Moreno,Yi Yao,Ye Tian,Xiaofan Yu,Flavio Ponzina,Michael Sullivan,Jingyi Zhang,Mingyu Yang,Hun Seok Kim,Tajana Rosing

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantic segmentation plays, scene understanding, autonomous driving, plays a pivotal, pivotal role

备注：

点击查看摘要

Abstract:LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

91. 【2604.12322】Self-Adversarial One Step Generation via Condition Shifting

链接：https://arxiv.org/abs/2604.12322

作者：Deyuan Liu,Peng Sun,Yansen Han,Zhenglin Cheng,Chuyan Chen,Tao Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：tradeoff among fidelity, text to image, image synthesis, synthesis has moved, existing methods

备注：

点击查看摘要

Abstract:The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model's current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available this https URL.

92. 【2604.12320】EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

链接：https://arxiv.org/abs/2604.12320

作者：Jianzhe Ma,Zhonghao Cao,Shangkui Chen,Yichen Xu,Wenxuan Wang,Qin Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：video large language, environments remain under-explored, large language models, excel in understanding, understanding slow-paced

备注： Work in progress

点击查看摘要

Abstract:While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

93. 【2604.12319】RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

链接：https://arxiv.org/abs/2604.12319

作者：Guoan Xu,Yang Xiao,Guangwei Gao,Dongchen Zhu,Wenjing Jia,Guo-Jun Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiple sensing modalities, enhancing scene understanding, leveraging complementary information, Multimodal semantic segmentation, Multimodal semantic

备注： 7tables,9 figures

点击查看摘要

Abstract:Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

94. 【2604.12318】Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge

链接：https://arxiv.org/abs/2604.12318

作者：Hayato Inoue,Shota Harada,Shumpei Takezaki,Ryoma Bise

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pipelines typically combine, Existing cell instance, segmentation pipelines typically, typically combine deterministic, Existing cell

备注：

点击查看摘要

Abstract:Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

95. 【2604.12315】GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

链接：https://arxiv.org/abs/2604.12315

作者：Zhiwei Zhang,Xingyuan Zeng,Xinkai Kong,Kunquan Zhang,Haoyuan Liang,Bohan Shi,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：sensing-based agricultural monitoring, remote sensing-based agricultural, supporting parcel surveying, Agricultural parcel extraction, agricultural monitoring

备注： 15 pages, 11 figures. Submitted to ACM Multimedia 2026 Dataset Track

点击查看摘要

Abstract:Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

96. 【2604.12309】owards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

链接：https://arxiv.org/abs/2604.12309

作者：Rong Wang,Ruyi Zha,Ziang Cheng,Jiayu Yang,Pulak Purkait,Hongdong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：consistent orbital videos, generating geometrically realistic, generating geometrically, consistent orbital, orbital videos

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

97. 【2604.12307】Boosting Robust AIGI Detection with LoRA-based Pairwise Training

链接：https://arxiv.org/abs/2604.12307

作者：Ruiyang Xia,Qi Zhang,Yaowen Xu,Zhaofan Zou,Hao Sun,Zhongjiang He,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：practical detection methods, realistic AI-Generated Image, highly realistic AI-Generated, achieve AIGI detection, AI-Generated Image Detection

备注： 3th place (3/514) technical report(CVPRW-26) at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge

点击查看摘要

Abstract:The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed "in the wild", where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

98. 【2604.12292】CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

链接：https://arxiv.org/abs/2604.12292

作者：Gaoxiang Cong,Liang Li,Jiaxin Ye,Zhedong Zhang,Hongming Shan,Yuankai Qi,Qingming Huang

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Movie dubbing aims, target video, aims to synthesize, synthesize speech, speech that preserves

备注：

点击查看摘要

Abstract:Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.

99. 【2604.12286】LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion

链接：https://arxiv.org/abs/2604.12286

作者：Clara Xue,Zizheng Yan,Zhenning Shi,Yuhang Yu,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Qingnan Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：key photo, short video clip, reselected key photo, photo capture ISP, Live Photo captures

备注： Accepted by ICLR 2026

点击查看摘要

Abstract:Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at this https URL.

100. 【2604.12281】MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

链接：https://arxiv.org/abs/2604.12281

作者：Dongkyung Kang,Jaeyeon Hwang,Junseo Park,Minji Kang,Yeryeong Lee,Beomseok Ko,Hanyoung Roh,Jeongmin Shin,Hyeryung Jang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Style transfer aims, aims to render, visual characteristics, Attention Mass Allocation, Mass Allocation

备注： 16 pages, 16 figures, 6 tables

点击查看摘要

Abstract:Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.

101. 【2604.12273】SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

链接：https://arxiv.org/abs/2604.12273

作者：Yexiong Lin,Jia Shi,Shanshan Ye,Wanyu Wang,Yu Yao,Tongliang Liu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：recent few-step methods, few-step methods achieving, methods achieving remarkable, remarkable inference acceleration, powerful generative framework

备注：

点击查看摘要

Abstract:Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones. To address this, we propose SubFlow, Sub-mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks. Project page: this https URL.

102. 【2604.12270】DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

链接：https://arxiv.org/abs/2604.12270

作者：Yuan Huang,Sijie Zhao,Jing Cheng,Hao Xu,Shaohui Jiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：maintaining temporal consistency, challenging open problem, visually coherent content, temporal consistency, remains a challenging

备注：

点击查看摘要

Abstract:Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

103. 【2604.12257】Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

链接：https://arxiv.org/abs/2604.12257

作者：Hang Xu,Chen Long,Bing Wang,Hao Chen,Zhen Dong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：robust visual perception, marine applications, essential for robust, robust visual, visual perception

备注：

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at this https URL.

104. 【2604.12255】ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

链接：https://arxiv.org/abs/2604.12255

作者：Huanzhen Wang,Ziheng Zhou,Jiaqi Song,Li He,Yunshi Lan,Yan Wang,Wenqiang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：wild remains challenging, remains challenging due, long-tail distributions, wild remains, remains challenging

备注：

点击查看摘要

Abstract:Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.

105. 【2604.12251】ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

链接：https://arxiv.org/abs/2604.12251

作者：Xinliang Wang,Yifeng Shi,Zhenyu Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, delivers high-fidelity real-time, high-fidelity real-time rendering, delivers high-fidelity, high-fidelity real-time

备注： The second author is the corresponding author

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.

106. 【2604.12245】Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown

链接：https://arxiv.org/abs/2604.12245

作者：Sandra Gómez-Gálvez,Tobias Olenyi,Gillian Dobbie,Katerina Taškova

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词：Deep neural networks, Deep neural, exhibit poor confidence, poor confidence calibration, confidence calibration

备注： Published at TMLR 2026. [this https URL](https://openreview.net/forum?id=DONqw1KhHq) Video: [this https URL](https://youtu.be/7WuSkC-aWW8?si=9fgq5ZN7euIyGZGU) Code: [this https URL](https://github.com/sandruskyi/SocratesLoss)

点击查看摘要

Abstract:Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.

107. 【2604.12239】Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

链接：https://arxiv.org/abs/2604.12239

作者：Manognya Lokesh Reddy,Zheng Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Driver Assistance Systems, Advanced Driver Assistance, Assistance Systems, Advanced Driver, Driver Assistance

备注： 17 pages, 9 figures

点击查看摘要

Abstract:Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing OCR text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work. Extensive outdoor experiments confirm a mean absolute error of 2.3% at 10 m and continuous distance output during brief plate occlusions, outperforming deep learning baselines by a factor of five in relative error.

108. 【2604.12221】BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

链接：https://arxiv.org/abs/2604.12221

作者：Qingyuan Cai,Saihui Hou,Xuecai Hu,Yongzhen Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reliable biometric technology, diverse clothing styles, faces significant challenges, significant challenges caused, biometric technology

备注： CVPR 2026, Project Page: [this https URL](https://github.com/BarbieGait/BarbieGait)

点击查看摘要

Abstract:Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.

109. 【2604.12219】Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

链接：https://arxiv.org/abs/2604.12219

作者：Wentai Zhang,Ronghui Xi,Shiyao Peng,Jiayu Huang,Haoran Luo,Zichen Tang,Haihong E

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Transformers have revolutionized, Video Diffusion Transformers, revolutionized high-fidelity video, massive computational burden, Diffusion Transformers

备注：

点击查看摘要

Abstract:Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.

110. 【2604.12175】Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

链接：https://arxiv.org/abs/2604.12175

作者：Xinjie Zhang,Qiang Li,Xiaowen Ma,Axi Niu,Li Yan,Qingsen Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image Editing Quality, Recent advances, Editing Quality Assessment, reliable Image Editing, image editing

备注：

点击查看摘要

Abstract:Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method's superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

111. 【2604.12163】Nucleus-Image: Sparse MoE for Image Generation

链接：https://arxiv.org/abs/2604.12163

作者：Chandan Akiti,Ajay Modukuri,Murali Nandan Nagarapu,Gunavardhan Akiti,Haozhe Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Pareto frontier, exceeding leading models, activating only approximately, forward pass, matching or exceeding

备注：

点击查看摘要

Abstract:We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

112. 【2604.12159】VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

链接：https://arxiv.org/abs/2604.12159

作者：Parth Parag Kulkarni,Rohit Gupta,Prakash Chandra Chhipa,Mubarak Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precise GPS coordinates, social media, map its trajectory, applications in forensics, aims to determine

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: this https URL

113. 【2604.12152】Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

链接：https://arxiv.org/abs/2604.12152

作者：Sebastian Cajas,Ashaba Judith,Rahul Gorijavolu,Sahil Kapadia,Hillary Clinton Kasimbazi,Leo Kinyera,Emmanuel Paul Kwesiga,Sri Sri Jaithra Varma Manthena,Luis Filipe Nakayama,Ninsiima Doreen,Leo Anthony Celi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：super-resolution universally inherit, universally inherit variational, image super-resolution universally, Latent diffusion models, inherit variational autoencoders

备注：

点击查看摘要

Abstract:Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at this https URL.

114. 【2604.12148】ViLL-E: Video LLM Embeddings for Retrieval

链接：https://arxiv.org/abs/2604.12148

作者：Rohit Gupta,Jayakrishnan Unnikrishnan,Fan Fei,Sheng Liu,Son Tran,Mubarak Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video Large Language, Video Question Answering, Large Language Models, Large Language, Question Answering

备注： Accepted at ACL 2026 Main conference

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

115. 【2604.12119】Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

链接：https://arxiv.org/abs/2604.12119

作者：Md Tanvirul Alam

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large vision-language models, separate perception failures, cleanly separate perception, Large vision-language, familiar semantic priors

备注：

点击查看摘要

Abstract:Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at this https URL.

116. 【2604.12115】HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

链接：https://arxiv.org/abs/2604.12115

作者：Xinyun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, Large vision-language, unstable visual grounding, achieve strong multimodal, strong multimodal performance

备注： 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.

117. 【2604.12113】PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

链接：https://arxiv.org/abs/2604.12113

作者：Minjae Lee,Sungwoo Hur,Soojin Hwang,Won Hwa Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Visual Foundation Models, Foundation Models, significantly advanced broad, Segment Anything Model, Visual Foundation

备注：

点击查看摘要

Abstract:Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM's mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

118. 【2604.12102】Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

链接：https://arxiv.org/abs/2604.12102

作者：Arun Sharma

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：spatial-aware research agents, introduce compute-grounded reasoning, Atlas instantiates CGR, asked to generate, introduce compute-grounded

备注： 11 pages. Submitted to NeurIPS 2026. Code: [this https URL](https://github.com/arunshar/spatial-atlas)

点击查看摘要

Abstract:We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

119. 【2604.12100】PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

链接：https://arxiv.org/abs/2604.12100

作者：Syed Fahim Ahmed,Gnanesh Rasineni,Florian Koehler,Abu Zahid Bin Aziz,Mei Wang,Attila Gyulassy,Brian Summa,J. Quincy Brown,Valerio Pascucci,Shireen Y. Elhabian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multiple Instance Learning, slide-level Multiple Instance, Multiple Instance, Whole-slide image, Instance Learning

备注： 11 pages, 2 figures, 2 tables. Under review at MICCAI 2026

点击查看摘要

Abstract:Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether "somewhere in the slide there is cancer." As a result, the model's inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

120. 【2604.12084】INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields

链接：https://arxiv.org/abs/2604.12084

作者：Bonian Han,Cong Qi,Przemyslaw Musialski,Zhi Wei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：measures mRNA expression, multi-slice analysis faces, preserving spatial organization, large non-rigid deformations, inter-slice batch effects

备注： 10 pages, 2 figures, 3 tables. Submitted to MICCAI 2026

点击查看摘要

Abstract:Spatial transcriptomics (ST) measures mRNA expression while preserving spatial organization, but multi-slice analysis faces two coupled difficulties: large non-rigid deformations across slices and inter-slice batch effects when alignment and integration are treated independently. We present INST-Align, an unsupervised pairwise framework that couples a coordinate-based deformation network with a shared Canonical Expression Field, an implicit neural representation mapping spatial coordinates to expression embeddings, for joint alignment and reconstruction. A two-phase training strategy first establishes a stable canonical embedding space and then jointly optimizes deformation and spatial-feature matching, enabling mutually constrained alignment and representation learning. Cross-slice parameter sharing of the canonical field regularizes ambiguous correspondences and absorbs batch variation. Across nine datasets, INST-Align achieves state-of-the-art mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance, with Chamfer reductions of up to 94.9\% on large-deformation sections relative to the strongest baseline. The framework also yields biologically meaningful spatial embeddings and coherent 3D tissue reconstruction. The code will be released after review phase.

121. 【2604.12075】OpenTME: An Open Dataset of AI-powered HE Tumor Microenvironment Profiles from TCGA

链接：https://arxiv.org/abs/2604.12075

作者：Maaike Galama,Nina Kozar-Gillan,Christina Embacher,Todd Dembo,Cornelius Böhm,Evelyn Ramberger,Julika Ribbat-Idel,Rosemarie Krupar,Verena Aumiller,Miriam Hägele,Kai Standvoss,Gerrit Erdmann,Blanca Pablos,Ari Angelo,Simon Schallenberg,Andrew Norgan,Viktor Matyas,Klaus-Robert Müller,Maximilian Alber,Lukas Ruff,Frederick Klauschen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：stained histopathology remains, histopathology remains scarce, Cancer Genome Atlas, quantitative TME characterization, treatment response

备注：

点击查看摘要

Abstract:The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (HE)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 HE-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas HE-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

122. 【2604.12068】Privacy-Preserving Structureless Visual Localization via Image Obfuscation

链接：https://arxiv.org/abs/2604.12068

作者：Vojtech Panek,Patrik Beliansky,Zuzana Kukelova,Torsten Sattler

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual localization, visual localization systems, task of estimating, Visual, localization

备注：

点击查看摘要

Abstract:Visual localization is the task of estimating the camera pose of an image relative to a scene representation. In practice, visual localization systems are often cloud-based. Naturally, this raises privacy concerns in terms of revealing private details through the images sent to the server or through the representations stored on the server. Privacy-preserving localization aims to avoid such leakage of private details. However, the resulting localization approaches are significantly more complex, slower, and less accurate than their non-privacy-preserving counterparts. In this paper, we consider structureless localization methods in the context of privacy preservation. Structureless methods represent the scene through a set of reference images with known camera poses and intrinsics. In contrast to existing methods proposing representations that are as privacy-preserving as possible, we study a simple image obfuscation approach based on common image operations, e.g., replacing RGB images with (semantic) segmentations. We show that existing structureless pipelines do not need any special adjustments, as modern feature matchers can match obfuscated images out of the box. The results are easy-to-implement pipelines that can ensure both the privacy of the query images and the scene representations. Detailed experiments on multiple datasets show that the resulting methods achieve state-of-the-art pose accuracy for privacy-preserving approaches.

123. 【2604.12035】Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

链接：https://arxiv.org/abs/2604.12035

作者：Kaizhen Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual token pruning, large language models, Visual token, multimodal large language, Expected Calibration Error

备注：

点击查看摘要

Abstract:Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

124. 【2604.12033】Benchmarking Deflection and Hallucination in Large Vision-Language Models

链接：https://arxiv.org/abs/2604.12033

作者：Nicholas Moratelli,Christopher Davis,Leonardo F. R. Ribeiro,Bill Byrne,Gonzalo Iglesias

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, Large Vision-Language, increasingly rely, Large, answer knowledge-intensive multimodal

备注： Accepted to ACL 2026

点击查看摘要

125. 【2604.12028】Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

链接：https://arxiv.org/abs/2604.12028

作者：Salar Adel Sabri,Ramadhan J. Mstafa

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：synthetic facial content, sophisticated generative models, facial content, raising serious concerns, digital trust

备注： 10 Pages, 6 Figures, 2 Tables

点击查看摘要

Abstract:The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.

126. 【2604.12012】IPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

链接：https://arxiv.org/abs/2604.12012

作者：Bingyi Cao,Koert Chen,Kevis-Kokitsi Maninis,Kaifeng Chen,Arjun Karpur,Ye Xia,Sahil Dua,Tanmaya Dabral,Guangxing Han,Bohyung Han,Joshua Ainslie,Alex Bewley,Mithun Jacob,René Wagner,Washington Ramos,Krzysztof Choromanski,Mojtaba Seyedhosseini,Howard Zhou,André Araujo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabled significant improvements, segmentation and depth, depth prediction, enabled significant, significant improvements

备注： CVPR2026 camera-ready + appendix

点击查看摘要

Abstract:Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at this https URL .

127. 【2604.11998】he Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

链接：https://arxiv.org/abs/2604.11998

作者：Xingyu Qiu,Yuqian Fu,Jiawei Geng,Bin Ren,Jiancheng Pan,Zongwei Wu,Hao Tang,Yanwei Fu,Radu Timofte,Nicu Sebe,Mohamed Elhoseiny,Lingyi Hong,Mingxi Cheng,Xingqi He,Runze Li,Xingdong Sheng,Wenqiang Zhang,Jiacong Liu,Shu Luo,Yikai Qin,Yaze Zhao,Yongwei Jiang,Yixiong Zou,Zhe Zhang,Yang Yang,Kaiyu Li,Bowen Fu,Zixuan Jiang,Ke Li,Hui Qiao,Xiangyong Cao,Xuanlong Yu,Youyang Sha,Longfei Liu,Di Yang,Xi Shen,Kyeongryeol Go,Taewoong Jang,Saiprasad Meesiyawar,Ravi Kirasur,Rakshita Kulkarni,Bhoomi Deshpande,Harsh Patil,Uma Mudenagudi,Shuming Hu,Chao Chen,Tao Wang,Wei Zhou,Qi Xu,Zhenzhao Xing,Dandan Zhao,Hanzhe Xia,Dongdong Lu,Zhe Zhang,Jingru Wang,Guangwei Huang,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Liwei Zhou,Bei Dou,Tao Wu,Zekang Fan,Junjie Liu,Adhémar de Senneville,Flavien Armangeon,Mengbers,Yazhe Lyu,Zhimeng Xin,Zijian Zhuang,Hongchun Zhu,Li Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Cross-domain few-shot object, few-shot object detection, Cross-domain few-shot, existing object detectors, few-shot learning approaches

备注： accepted by CVPRW 26 @ NTIRE

点击查看摘要

Abstract:Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: this https URL.

128. 【2604.11993】Ultra-low-light computer vision using trained photon correlations

链接：https://arxiv.org/abs/2604.11993

作者：Mandar M. Sohoni,Jérémie Laydevant,Mathieu Ouellet,Shi-Yuan Ma,Ryotatsu Yanagimoto,Benjamin A. Ash,Tatsuhiro Onodera,Tianyu Wang,Logan G. Wright,Peter L. McMahon

类目：Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词：detector clicks due, allowing high-fidelity images, noisy camera frames, spatially correlated, noise are uncorrelated

备注： 49 pages, 47 figures

点击查看摘要

Abstract:Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene -- such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task -- object recognition -- and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.

129. 【2604.11992】ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

链接：https://arxiv.org/abs/2604.11992

作者：Daniel Yang,Jungseok Hong,John J. Leonard,Yogesh Girdhar

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, field robot applications, computationally intensive processes, powerful visual representation, poses typically obtained

备注：

点击查看摘要

Abstract:3D Gaussian Splatting is a powerful visual representation, providing high-quality and efficient 3D scene reconstruction, but it is crucially dependent on accurate camera poses typically obtained from computationally intensive processes like structure-from-motion that are unsuitable for field robot applications. However, in these domains, multimodal sensor data from acoustic, inertial, pressure, and visual sensors are available and suitable for pose-graph optimization-based SLAM methods that can estimate the vehicle's trajectory and thus our needed camera poses while providing uncertainty. We propose a 3DGS-based incremental reconstruction framework, ReefMapGS, that builds an initial model from a high certainty region and progressively expands to incorporate the whole scene. We reconstruct the scene incrementally by interleaving local tracking of new image observations with optimization of the underlying 3DGS scene. These refined poses are integrated back into the pose-graph to globally optimize the whole trajectory. We show COLMAP-free 3D reconstruction of two underwater reef sites with complex geometry as well as more accurate global pose estimation of our AUV over survey trajectories spanning up to 700 m.

130. 【2604.11970】INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

链接：https://arxiv.org/abs/2604.11970

作者：Somraj Gautam,Anathapindika Dravichi,Gaurav Harit

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Visual Question Answering, Bahasa Indonesia, Question Answering, real-world document images, Table Visual Question

备注： Accepted in ACL 2026 (Findings)

点击查看摘要

131. 【2604.11961】Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

链接：https://arxiv.org/abs/2604.11961

作者：Chitra Banarjee,Patrick Kwon,Ania Lipat,Rui Xie,Chen Chen,Ladda Thiamwong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：key clinical indicator, Human Mesh Recovery, key clinical, clinical indicator, older adults

备注： Work was accepted at Computer Vision for Biomechanics Workshop (CVBW) at CVPR 2026

点击查看摘要

Abstract:Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.

132. 【2604.11932】EigenCoin: sassanid coins classification based on Bhattacharyya distance

链接：https://arxiv.org/abs/2604.11932

作者：Rahele Allahverdi,Mohammad Mahdi Dehshibi,Azam Bastanfard,Daryoosh Akbarzadeh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Solving pattern recognition, pattern recognition problems, Solving pattern, hot topic, pattern recognition

备注： 2nd World Conference on Information Technology (WCIT-2011)

点击查看摘要

Abstract:Solving pattern recognition problems using imbalanced databases is a hot topic, which entices researchers to bring it into focus. Therefore, we consider this problem in the application of Sassanid coins classification. Our focus is not only on proposing EigenCoin manifold with Bhattacharyya distance for the classification task, but also on testing the influence of the holistic and feature-based approaches. EigenCoin consists of three main steps namely manifold construction, mapping test data, and classification. Conducted experiments show EigenCoin outperformed other observed algorithms and achieved the accuracy from 9.45% up to 21.75%, while it has the capability of handling the over-fitting problem.

133. 【2604.11927】A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

链接：https://arxiv.org/abs/2604.11927

作者：Tamerlan Mustafaev,Oleg Kruglov,Margarita Zuley,Luana de Mero Omena,Guilherme Muniz de Oliveira,Vitor de Sousa Franca,Bruno Barufaldi,Robert Nishikawa,Juhun Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Digital breast tomosynthesis, breast cancer screening, Digital breast, breast tomosynthesis, DBT

备注：

点击查看摘要

Abstract:Digital breast tomosynthesis (DBT) is now the standard of care for breast cancer screening in the USA. Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized risk estimation, but algorithm development is limited by scarce human-delineated training data. In this study we introduce a time- and labor-saving framework to generate a human-annotated binary segmentation mask for dense tissue in DBT. Our framework enables a user to outline a rough region of interest (ROI) enclosing dense tissue on the central reconstructed slice of a DBT volume and select a segmentation threshold to generate the dense tissue mask. The algorithm then projects the ROI to the remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume. By requiring annotation only on the central slice, the framework substantially reduces annotation time and labor. We used 44 DBT volumes from the DBTex dataset for evaluation. Inter-reader agreement was assessed by computing patient-wise Dice similarity coefficients between segmentation masks produced by two radiologists, yielding a median of 0.84. Accuracy of the proposed method was evaluated by having a radiologist manually segment the 20th and 80th percentile slices from each volume (CC and MLO views; 176 slices total) and calculate Dice scores between the manual and proposed segmentations, yielding a median of 0.83.

134. 【2604.11913】V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

链接：https://arxiv.org/abs/2604.11913

作者：Chengkun Yue,Chuanzhi Xu,Jiangpeng He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing approaches largely, approaches largely rely, finally completed dish, Nutrition estimation, computational health

备注： Accepted to the 3rd MetaFood Workshop at CVPR 2026

点击查看摘要

Abstract:Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at this https URL.

135. 【2604.11868】MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

链接：https://arxiv.org/abs/2604.11868

作者：Md Rakibul Haque,KM Arefeen Sultan,Tushar Kataria,Shireen Elhabian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve strong performance, limit clinical trust, diagnosis prediction, explain predictions, achieve strong

备注：

点击查看摘要

Abstract:While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke

136. 【2604.11843】UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

链接：https://arxiv.org/abs/2604.11843

作者：Yigit Yilmaz,Elena Petrova,Mehmet Kaya,Lucia Rossi,Amir Rahman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：tracing AI-generated content, recently gained attention, protecting image ownership, Invisible watermarking, AI-generated content

备注： work in progress

点击查看摘要

Abstract:Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method{}, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method{} introduces three core components: \textbf{Adaptive Semantic Grouping (ASG)}, which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbf{Block-wise Multi-bit Encoding (BME)}, which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbf{a Unified Token-Replacement Interface (UTRI)} that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method{} achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

137. 【2509.25749】ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

链接：https://arxiv.org/abs/2509.25749

作者：Junseo Park,Hyeryung Jang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：requiring precise garment, generate realistic images, precise garment alignment, Virtual try-on, target garment

备注： 21 pages

点击查看摘要

Abstract:Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

138. 【2604.12970】Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

链接：https://arxiv.org/abs/2604.12970

作者：Nafis Fuad Shahid,Maroof Ahmed,Md Akib Haider,Saidur Rahman Sagor,Aashnan Rahman,Md Azam Hossain

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted for publication at the Medical Imaging with Deep Learning (MIDL) 2026 conference

点击查看摘要

None

139. 【2604.12778】DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy

链接：https://arxiv.org/abs/2604.12778

作者：Fan Xiao,Nikolaos Delopoulos,Niklas Wahl,Lennart Volz,Lina Bucher,Matteo Maspero,Miguel Palacios,Muheng Li,Samir Schulz,Viktor Rogowski,Ye Zhang,Zoltan Perko,Christopher Kurz,George Dedes,Guillaume Landry,Adrian Thummerer

类目：Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate dose calculation, sparing healthy tissue, precise tumor irradiation, dose calculation methods, dose calculation

备注：

点击查看摘要

Abstract:Purpose: Accurate dose calculation is essential in radiotherapy for precise tumor irradiation while sparing healthy tissue. With the growing adoption of MRI-guided and real-time adaptive radiotherapy, fast and accurate dose calculation on CT and MRI is increasingly needed. The DoseRAD2026 dataset and challenge provide a public benchmark of paired CT and MRI data with beam-level photon and proton Monte Carlo dose distributions for developing and evaluating advanced dose calculation methods. Acquisition and validation methods: The dataset comprises paired CT and MRI from 115 patients (75 training, 40 testing) treated on an MRI-linac for thoracic or abdominal lesions, derived from the SynthRAD2025 dataset. Pre-processing included deformable image registration, air-cavity correction, and resampling. Ground-truth photon (6 MV) and proton dose distributions were computed using open-source Monte Carlo algorithms, yielding 40,500 photon beams and 81,000 proton beamlets. Data format and usage notes: Data are organized into photon and proton subsets with paired CT-MRI images, beam-level dose distributions, and JSON beam configuration files. Files are provided in compressed MetaImage (.mha) format. The dataset is released under CC BY-NC 4.0, with training data available from April 2026 and the test set withheld until March 2030. Potential applications: The dataset supports benchmarking of fast dose calculation methods, including beam-level dose estimation for photon and proton therapy, MRI-based dose calculation in MRI-guided workflows, and real-time adaptive radiotherapy.

140. 【2604.12305】CBAM-Enhanced DenseNet121 for Multi-Class Chest X-Ray Classification with Grad-CAM Explainability

链接：https://arxiv.org/abs/2604.12305

作者：Utsho Kumar Dey

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：childhood mortality worldwide, Bangladesh where radiologist, Block Attention Module, mortality worldwide, availability is limited

备注： 10 pages, 7 figures, 2 tables. Preprint submitted to IEEE Access

点击查看摘要

Abstract:Pneumonia remains a leading cause of childhood mortality worldwide, with a heavy burden in low-resource settings such as Bangladesh where radiologist availability is limited. Most existing deep learning approaches treat pneumonia detection as a binary problem, overlooking the clinically critical distinction between bacterial and viral aetiology. This paper proposes CBAM-DenseNet121, a transfer-learning framework that integrates the Convolutional Block Attention Module (CBAM) into DenseNet121 for three-class chest X-ray classification: Normal, Bacterial Pneumonia, and Viral Pneumonia. We also conduct a systematic binary-task baseline study revealing that EfficientNetB3 (73.88%) underperforms even the custom CNN baseline (78.53%) -- a practically important negative finding for medical imaging model selection. To ensure statistical reliability, all experiments were repeated three times with independent random seeds (42, 7, 123), and results are reported as mean +/- standard deviation. CBAM-DenseNet121 achieves 84.29% +/- 1.14% test accuracy with per-class AUC scores of 0.9565 +/- 0.0010, 0.9610 +/- 0.0014, and 0.9187 +/- 0.0037 for bacterial pneumonia, normal, and viral pneumonia respectively. Grad-CAM visualizations confirm that the model attends to anatomically plausible pulmonary regions for each class, supporting interpretable deployment in resource-constrained clinical environments.

141. 【2604.11817】QMC-Net: Data-Aware Quantum Representations for Remote Sensing Image Classification

链接：https://arxiv.org/abs/2604.11817

作者：Md Aminur Hossain,Ayush V. Patel,Biplab Banerjee

类目：Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词：channel-specific statistical variability, multi-band remote sensing, remote sensing imagery, quantum-classical models offer, data-agnostic quantum circuits

备注： Accepted in ICPR 2026, 15 pages

点击查看摘要

Abstract:Hybrid quantum-classical models offer a promising route for learning from complex data; however, their application to multi-band remote sensing imagery often relies on generic, data-agnostic quantum circuits that fail to account for channel-specific statistical variability. In this work, we propose a data-driven framework that maps band-level statistics such as Shannon Entropy, Variance, Spectral Flatness, and Edge Density to the hyperparameters of customized quantum circuits. Building on this framework, we introduce QMC-Net, a hybrid architecture that processes six data channels using band-specific quantum circuits, enabling adaptive quantum feature encoding and transformation across channels. Experiments on the EuroSAT and SAT-6 datasets demonstrate that QMC-Net achieves accuracies of 93.80 % and 99.34 %, respectively, while a residual-enhanced variant further improves performance to 94.69 % and 99.39 %. These results consistently outperform strong classical baselines and monolithic hybrid quantum models, highlighting the effectiveness of data-aware quantum circuit design under NISQ constraints.