本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新676篇论文,其中:

  • 自然语言处理121
  • 信息检索35
  • 计算机视觉134

自然语言处理

1. 【2604.06170】Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

链接https://arxiv.org/abs/2604.06170

作者:Komal Kumar,Aman Chadha,Salman Khan,Fahad Shahbaz Khan,Hisham Cholakkal

类目:Computation and Language (cs.CL)

关键词:synthesize relevant work, Paper Circle, efficiently discover, relevant work, rapid growth

备注: 19 pages, 7 figures, 8 tables, ACL main (Oral)

点击查看摘要

Abstract:The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at this https URL and the code at this https URL.

2. 【2604.06169】In-Place Test-Time Training

链接https://arxiv.org/abs/2604.06169

作者:Guhao Feng,Shengjie Luo,Kai Hua,Ge Zhang,Di He,Wenhao Huang,Tianle Cai

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:fundamentally limits Large, limits Large Language, Large Language Models, limits Large, Large Language

备注: ICLR 2026 Oral Presentation; Code is released at [this https URL](https://github.com/ByteDance-Seed/In-Place-TTT)

点击查看摘要

Abstract:The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

3. 【2604.06156】MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

链接https://arxiv.org/abs/2604.06156

作者:Yuchi Wang,Haiyang Yu,Weikang Bian,Jiefeng Long,Xiao Liang,Chao Feng,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:capabilities remain underutilized, generative reasoning capabilities, reasoning capabilities remain, remain underutilized, successfully applied

备注

点击查看摘要

Abstract:MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

4. 【2604.06155】oward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

链接https://arxiv.org/abs/2604.06155

作者:Qimin Zhong,Hao Liao,Haiming Qin,Mingyang Zhou,Rui Mao,Wei Chen,Naipeng Chao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, world models remains, Large Language, Language Models, internal world models

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

5. 【2604.06154】Exclusive Unlearning

链接https://arxiv.org/abs/2604.06154

作者:Mutsumi Sasaki,Kouta Nakayama,Yusuke Miyao,Yohei Oseki,Masaru Isonuma

类目:Computation and Language (cs.CL)

关键词:introducing Large Language, Large Language Models, Large Language, introducing Large, generating harmful content

备注

点击查看摘要

Abstract:When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

6. 【2604.06111】ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

链接https://arxiv.org/abs/2604.06111

作者:Wang Yang,Chaoda Song,Xinpeng Li,Debargha Ganguly,Chuang Ma,Shouren Wang,Zhihao Dou,Yuli Zhou,Vipin Chaudhary,Xiaotian Han

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:aggregate scores unreliable, Existing Agent benchmarks, make aggregate scores, high environment interaction, imbalanced task horizon

备注

点击查看摘要

Abstract:Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

7. 【2604.06098】JUÁ - A Benchmark for Information Retrieval in Brazilian Legal Text Collections

链接https://arxiv.org/abs/2604.06098

作者:Jayr Pereira,Leandro Fernandes,Erick de Brito,Roberto Lotufo,Luiz Bonifacio

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Portuguese remains difficult, datasets differ widely, Brazilian legal, JUÁ, query style

备注

点击查看摘要

Abstract:Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textsc{JUÁ}, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textsc{JUÁ} is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textsc{JUÁ}-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textsc{JUÁ-Juris} subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textsc{JUÁ} provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

8. 【2604.06091】Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

链接https://arxiv.org/abs/2604.06091

作者:Changgeon Ko,Jisu Shin,Hoyun Song,Huije Lee,Eui Jun Hwang,Jong C. Park

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Large language model, Large language, agent integrates diverse, diverse peer perspectives, integrates diverse peer

备注: ACL 2026

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent's accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent's judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.

9. 【2604.06086】LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

链接https://arxiv.org/abs/2604.06086

作者:Olexander Mazurets,Olexander Barmak,Leonid Bedratyuk,Iurii Krak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Modern Transformer-based language, language processing tasks, natural language processing, Transformer-based language models, Modern Transformer-based

备注

点击查看摘要

Abstract:Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introduces LAG-XAI (Lie Affine Geometry for Explainable AI), a novel geometric framework that models paraphrasing not as discrete word substitutions, but as a structured affine transformation within the embedding space. By conceptualizing paraphrasing as a continuous geometric flow on a semantic manifold, we propose a computationally efficient mean-field approximation, inspired by local Lie group actions. This allows us to decompose paraphrase transitions into geometrically interpretable components: rotation, deformation, and translation. Experiments on the noisy PIT-2015 Twitter corpus, encoded with Sentence-BERT, reveal a "linear transparency" phenomenon. The proposed affine operator achieves an AUC of 0.7713. By normalizing against random chance (AUC 0.5), the model captures approximately 80% of the non-linear baseline's effective classification capacity (AUC 0.8405), offering explicit parametric interpretability in exchange for a marginal drop in absolute accuracy. The model identifies fundamental geometric invariants, including a stable matrix reconfiguration angle (~27.84°) and near-zero deformation, indicating local isometry. Cross-domain generalization is confirmed via direct cross-corpus validation on an independent TURL dataset. Furthermore, the practical utility of LAG-XAI is demonstrated in LLM hallucination detection: using a "cheap geometric check," the model automatically detected 95.3% of factual distortions on the HaluEval dataset by registering deviations beyond the permissible semantic corridor. This approach provides a mathematically grounded, resource-efficient path toward the mechanistic interpretability of Transformers.

10. 【2604.06071】Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

链接https://arxiv.org/abs/2604.06071

作者:Ben Wigler,Maria Tsfasman,Tiffany Matej Hrkalovic

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:large language models, persona descriptions, natural language, large language, richly encoded

备注: Under review at COLM

点击查看摘要

Abstract:Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

11. 【2604.06070】Short Data, Long Context: Distilling Positional Knowledge in Transformers

链接https://arxiv.org/abs/2604.06070

作者:Patrick Huber,Ernie Chang,Chinnadhurai Sankar,Rylan Conway,Igor Fedorov,Md Rifat Arefin,Adithya Sagar

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:posing significant challenges, typically requires expensive, Extending the context, language models typically, models typically requires

备注

点击查看摘要

Abstract:Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

12. 【2604.06066】From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

链接https://arxiv.org/abs/2604.06066

作者:Hongxu Zhou

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, open-ended reasoning tasks, reasoning tasks due, recursively justify early

备注

点击查看摘要

Abstract:Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to ``hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed ``structure snowballing.'' We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax'' inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: this https URL.

13. 【2604.06028】A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

链接https://arxiv.org/abs/2604.06028

作者:Maria Mahbub,Gregory M. Dams,Josh Arnold,Caitlin Rizy,Sudarshan Srinivasan,Elliot M. Fielstein,Minu A. Aghevli,Kamonica L. Craig,Elizabeth M. Oliva,Joseph Erdos,Jodie Trafton,Ioana Danciu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, unstructured health records, extracting clinically meaningful, Large language, clinically meaningful information

备注

点击查看摘要

Abstract:Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

14. 【2604.06022】BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

链接https://arxiv.org/abs/2604.06022

作者:Zhongxing Zhang,Emily K. Vraga,Jisu Huh,Jaideep Srivastava

类目:Computation and Language (cs.CL)

关键词:Incorrect information poses, disrupting content veracity, textual content verification, balance textual content, information poses significant

备注

点击查看摘要

Abstract:Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

15. 【2604.06013】Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

链接https://arxiv.org/abs/2604.06013

作者:Michael Cuccarese

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multiple biological datasets, large language models, paper presents epistemic, presents epistemic blinding, paper presents

备注: code and LLM skill at: [this https URL](https://github.com/mcuccarese/epistemic-blinding) 7 pages 3 figures

点击查看摘要

Abstract:This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data-driven inference with memorized priors about named entities - and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model's training memory. Epistemic blinding is a simple inference-time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model's parametric knowledge. The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top-20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in SP 500 equity screening, brand-recognition bias reshapes 30-40% of top-20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open-source tool and as a Claude Code skill that enables one-command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.

16. 【2604.06005】Disentangling MLP Neuron Weights in Vocabulary Space

链接https://arxiv.org/abs/2604.06005

作者:Asaf Avrahamy,Yoav Gur-Arieh,Mor Geva

类目:Computation and Language (cs.CL)

关键词:Rotation-Optimized Token Alignment, mechanistic interpretability, information encoded, remains a fundamental, fundamental challenge

备注

点击查看摘要

Abstract:Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

17. 【2604.05995】he Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

链接https://arxiv.org/abs/2604.05995

作者:Xiaojie Gu,Ziying Huang,Weicong Hong,Jian Xie,Renze Lou,Kai Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词

备注: ACL 2026 Findings

点击查看摘要

None

18. 【2604.05983】Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

链接https://arxiv.org/abs/2604.05983

作者:Shuqing Zhao

类目:Programming Languages (cs.PL); Computation and Language (cs.CL)

关键词:Register-transfer Clocked Hardware, AI-native Register-transfer Clocked, hardware description language, AI-assisted code generation, description language designed

备注

点击查看摘要

Abstract:We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language designed from first principles for micro-architecture specification and AI-assisted code generation. Arch introduces first-class language constructs for pipelines, FSMs, FIFOs, arbiters, register files, buses, and clock-domain crossings -- structures that existing HDLs express only as user-defined patterns prone to subtle errors. A central design choice is that clocks and resets are themselves parameterized types (ClockD, ResetS,P,D?) rather than ordinary nets, converting clock-domain crossing (CDC) and reset-domain crossing (RDC) analysis from external linter passes into compile-time typing rules. Combined with simultaneous tracking of bit widths, port directions, single-driver ownership, and combinational acyclicity, the type system catches multiple drivers, undriven ports, implicit latches, width mismatches, combinational loops, and unsynchronized domain crossings before any simulator runs. Every syntactic choice is governed by an AI-generatability contract: an LL(1) grammar requiring no backtracking or multi-token lookahead, no preprocessor or macros, a uniform declaration schema, named block endings, explicit directional connect arrows, and a todo! escape hatch enable LLMs to produce structurally correct, type-safe Arch from natural-language specifications without fine-tuning. The Arch compiler emits deterministic, lint-clean IEEE 1800-2017 SystemVerilog and provides an integrated simulation toolchain that generates compiled C++ models for cycle-accurate simulation. We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other modern HDLs across expressiveness, safety, and AI suitability dimensions.

Subjects:

Programming Languages (cs.PL); Computation and Language (cs.CL)

ACMclasses:
D.3; B.5

Cite as:
arXiv:2604.05983 [cs.PL]

(or
arXiv:2604.05983v1 [cs.PL] for this version)

https://doi.org/10.48550/arXiv.2604.05983

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shuqing Zhao [view email] [v1]
Tue, 7 Apr 2026 15:12:14 UTC (31 KB)

19. 【2604.05971】Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

链接https://arxiv.org/abs/2604.05971

作者:Oscar Chew,Hsiao-Ying Huang,Kunal Jain,Tai-I Chen,Khoa D Doan,Kuan-Hao Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:lack fine-grained understanding, contrastive vision-language models, recent model variants, research has shown, shown that contrastive

备注

点击查看摘要

Abstract:Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.

20. 【2604.05966】FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

链接https://arxiv.org/abs/2604.05966

作者:Fan Zhang,Mingzi Song,Rania Elbadry,Yankai Chen,Shaobo Wang,Yixi Zhou,Xunwen Zheng,Yueru He,Yuyang Dai,Georgi Georgiev,Ayesha Gull,Muhammad Usman Safder,Fan Wu,Liyuan Meng,Fengxian Ji,Junning Zhao,Xueqing Peng,Jimin Huang,Yu Chen, Xue (Steve)Liu,Preslav Nakov,Zhuohan Xie

类目:Computation and Language (cs.CL)

关键词:large language models, summarize corporate disclosures, language models, corporate disclosures, increasingly use large

备注: 9 pages, including figures and tables

点击查看摘要

Abstract:Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at this https URL . The video describing our system is available at this https URL

21. 【2604.05952】owards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

链接https://arxiv.org/abs/2604.05952

作者:Yi Yuan,Xuhong Wang,Shanzhe Lei

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:automatically generating research-style, generating research-style reports, agent-based systems continue, continue to evolve, diverse domains

备注: 20 pages, 3 tables, 2 figures

点击查看摘要

Abstract:As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains. While these agents promise to streamline information synthesis and knowledge exploration, existing evaluation frameworks-typically based on subjective dimensions-fail to capture a critical aspect of report quality: trustworthiness. In open-ended research scenarios where ground-truth answers are unavailable, current evaluation methods cannot effectively measure the epistemic confidence of generated content, making calibration difficult and leaving users susceptible to misleading or hallucinated information. To address this limitation, we propose a novel deep research agent that incorporates progressive confidence estimation and calibration within the report generation pipeline. Our system leverages a deliberative search model, featuring deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combined with a carefully designed workflow, this approach produces trustworthy reports with enhanced transparency. Experimental results and case studies demonstrate that our method substantially improves interpretability and significantly increases user trust.

22. 【2604.05942】BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

链接https://arxiv.org/abs/2604.05942

作者:Abbas Ghaddar,Ivan Kobyzev,Boxing Chen,Yufei Cui

类目:Computation and Language (cs.CL)

关键词:replaces quadratic self-attention, large language models, Post-training hybridization, language models, sliding-window attention

备注: ACL 2026 (Main Conference)

点击查看摘要

Abstract:Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head's local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.

23. 【2604.05930】"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

链接https://arxiv.org/abs/2604.05930

作者:Naen Xu,Jiayi Sheng,Changjiang Li,Chunyi Zhou,Yuyuan Li,Tianyu Du,Jun Wang,Zhihui Fu,Jinbao Li,Shouling Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:common form, form of rhetorical, rhetorical wordplay, wordplay that exploits, exploits polysemy

备注: ACL 2026 Main

点击查看摘要

Abstract:Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

24. 【2604.05923】he UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

链接https://arxiv.org/abs/2604.05923

作者:Hongxu Zhou

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:hierarchical structures Sarrof, structures Sarrof, star-free sequential tasks, bounded hierarchical structures, shown to possess

备注

点击查看摘要

Abstract:State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.

25. 【2604.05912】FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

链接https://arxiv.org/abs/2604.05912

作者:Michael Krumdick,Varshini Reddy,Shivani Chaudhary,William Day,Maarij Ahmed,Hayan Haqqi,Muhammad Ahsen Fahim,Hanzallah Amjad,Ahmad Orakzai,Aqsa Gul,Chris Tanner

类目:Computation and Language (cs.CL)

关键词:concerns surrounding AI-driven, existing benchmarks fail, practical professional expertise, AI-driven labor displacement, labor displacement intensify

备注

点击查看摘要

Abstract:As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.

26. 【2604.05899】FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents

链接https://arxiv.org/abs/2604.05899

作者:Cherifa Ben Khelil,Jean-Yves Antoine,Anaïs Halftermeyer,Frédéric Rayar,Mathieu Thebaud

类目:Computation and Language (cs.CL)

关键词:linguistic resource specifically, resource specifically tailored, linguistic resource, resource specifically, specifically tailored

备注: 5 pages, 1 figure

点击查看摘要

Abstract:In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.

27. 【2604.05876】Mechanistic Circuit-Based Knowledge Editing in Large Language Models

链接https://arxiv.org/abs/2604.05876

作者:Tianyi Zhao,Yinhan He,Wendy Zheng,Chen Chen

类目:Computation and Language (cs.CL)

关键词:Deploying Large Language, Large Language Models, Deploying Large, Large Language, real-world dynamic environments

备注

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a "Reasoning Gap", where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise "map-and-adapt" editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

28. 【2604.05872】Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

链接https://arxiv.org/abs/2604.05872

作者:Fatih Uenal

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Swiss-focused evaluation frameworks, existing Swiss-focused evaluation, regulatory contexts demands, contexts demands empirical, demands empirical evidence

备注: 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The deployment of large language models (LLMs) in Swiss financial and regulatory contexts demands empirical evidence of both production reliability and adversarial security, dimensions not jointly operationalized in existing Swiss-focused evaluation frameworks. This paper introduces Swiss-Bench 003 (SBP-003), extending the HAAS (Helvetic AI Assessment Score) from six to eight dimensions by adding D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security). I evaluate ten frontier models across 808 Swiss-specific items in four languages (German, French, Italian, English), comprising seven Swiss-adapted benchmarks (Swiss TruthfulQA, Swiss IFEval, Swiss SimpleQA, Swiss NIAH, Swiss PII-Scope, System Prompt Leakage, and Swiss German Comprehension) targeting FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and OWASP Top 10 for LLMs. Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes. System prompt leakage resistance ranges from 24.8% to 88.2%, while PII extraction defense remains weak (14-42%) across all models. Qwen 3.5 Plus achieves the highest self-graded D7 score (94.4%), while GPT-oss 120B achieves the highest D8 score (60.7%) despite being the lowest-cost model evaluated. All evaluations are zero-shot under provider default settings; D7 is self-graded and does not constitute independently validated accuracy. I provide conceptual mapping tables relating benchmark dimensions to FINMA model validation requirements, nDSG data protection obligations, and OWASP LLM risk categories.

29. 【2604.05868】Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

链接https://arxiv.org/abs/2604.05868

作者:Xiangming Gu,Soham De,Larisa Markeeva,Petar Veličković,Razvan Pascanu

类目:Computation and Language (cs.CL)

关键词:Large Reasoning Models, Large Reasoning, shown remarkable performance, Reasoning Models, shown remarkable

备注: Under review

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

30. 【2604.05866】Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching

链接https://arxiv.org/abs/2604.05866

作者:Yicheng Pan,Zhiyuan Ning,Ludi Wang,Yi Du

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:accurately recommending suitable, recommending suitable reviewers, conference submission volumes, submission volumes continue, continue to grow

备注: Accepted by IJCNN-2026

点击查看摘要

Abstract:As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper'' matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.

31. 【2604.05863】LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

链接https://arxiv.org/abs/2604.05863

作者:Xiao Qin,Xingyi Song,Tong Liu,Hatim Laalej,Zepeng Liu,Yunpeng Zhu,Ligang He

类目:Computation and Language (cs.CL)

关键词:Rotating Machinery, rotating-machinery signal understanding, self-supervised framework, multi-modal rotating-machinery signal, Machinery

备注

点击查看摘要

Abstract:We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at this https URL.

32. 【2604.05848】Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

链接https://arxiv.org/abs/2604.05848

作者:Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:preserve meaningful differences, educational AI systems, highly context-dependent, Learner representations play, play a central

备注: Accepted to AIED 2026

点击查看摘要

Abstract:Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation. Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student's interactions over time. Results show that learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre-deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

33. 【2604.05846】AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

链接https://arxiv.org/abs/2604.05846

作者:Yuanfu Sun,Kang Li,Dongzhe Fan,Jiajin Liu,Qiaoyu Tan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, agentic capabilities-iterative retrieval, parametric knowledge

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at this https URL.

34. 【2604.05830】"OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection

链接https://arxiv.org/abs/2604.05830

作者:Fernando López,Paula Delgado-Santos,Pablo Gómez,David Solans,Jordi Luque

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:critical challenge due, achieving fair Wake-up, Voice-based interfaces, diverse speaker populations, speaker populations remains

备注: Accepted at Speech Language Models in Low-Resource Settings: Performance, Evaluation, and Bias Analysis (SPEAKABLE) - LREC2026 Workshops

点击查看摘要

Abstract:Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94\% for sex, 83.65\% for age, and 40.48\% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.

35. 【2604.05821】CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

链接https://arxiv.org/abs/2604.05821

作者:Seungyoon Lee,Minhyuk Kim,Seongtae Hong,Youngjoon Jang,Dongsuk Oh,Heuiseok Lim

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Existing multilingual embedding, imbalanced linguistic resources, multilingual embedding models, Existing multilingual, embedding models

备注: ACL2026 Main

点击查看摘要

Abstract:Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at this https URL.

36. 【2604.05818】WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

链接https://arxiv.org/abs/2604.05818

作者:Yingjian Zhu,Xinming Wang,Kun Ding,Ying Wang,Bin Fan,Shiming Xiang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Visual Question Answering, Knowledge-Based Visual Question, Question Answering, Visual Question, highly effective paradigm

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on this https URL.

37. 【2604.05795】Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

链接https://arxiv.org/abs/2604.05795

作者:Abdullah Mazhar,Het Riteshkumar Shah,Aseem Srivastava,Smriti Joshi,Md Shad Akhtar

类目:Computation and Language (cs.CL)

关键词:large language models, health applications calls, surface-level fluency, large language, language models

备注: Accepted at ACL 2026 (Main)

点击查看摘要

Abstract:The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

38. 【2604.05779】What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say "I Don't Know"

链接https://arxiv.org/abs/2604.05779

作者:Joosung Lee,Hwiyeol Jo,Donghyeon Ko,Kyubyung Chae,Cheonbok Park,Jeonghoon Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrate strong capabilities, diverse user queries, large language models, demonstrate strong, suffer from hallucinations

备注: 8 pages

点击查看摘要

Abstract:While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model's existing knowledge, while encouraging explicit "I don't know" responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

39. 【2604.05775】PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

链接https://arxiv.org/abs/2604.05775

作者:Yusen Hou,Weicai Long,Haitao Hu,Houcheng Su,Junning Feng,Yanlin Zhang

类目:Computation and Language (cs.CL); Genomics (q-bio.GN)

关键词:regulating microbial ecosystems, play a critical, antibiotic alternatives, dark matter, critical role

备注

点击查看摘要

Abstract:Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

40. 【2604.05767】Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

链接https://arxiv.org/abs/2604.05767

作者:Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Hernan Matzner

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:production ADAS systems, ADAS systems, large-scale ego-centric dashcam, production ADAS, Toggle

备注

点击查看摘要

Abstract:We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar's Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as:
arXiv:2604.05767 [cs.CV]

(or
arXiv:2604.05767v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05767

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Roni Goldshmidt [view email] [v1]
Tue, 7 Apr 2026 12:10:21 UTC (2,554 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0, by Roni Goldshmidt and 3 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

41. 【2604.05757】Identifying Influential N-grams in Confidence Calibration via Regression Analysis

链接https://arxiv.org/abs/2604.05757

作者:Shintaro Ozaki,Wataru Hashimoto,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe

类目:Computation and Language (cs.CL)

关键词:expressions demonstrating uncertainty, large language models, demonstrating uncertainty, large language, include linguistic expressions

备注

点击查看摘要

Abstract:While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific $n$-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

42. 【2604.05756】Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

链接https://arxiv.org/abs/2604.05756

作者:Yanbei Jiang,Amr Keleg,Ryandito Diandaru,Jey Han Lau,Lea Frermann,Biaoyan Fang,Fajri Koto

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, fixed ground truths, inherently stochastic

备注: Accepted at ACL Main Conference

点击查看摘要

Abstract:While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.

43. 【2604.05738】MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

链接https://arxiv.org/abs/2604.05738

作者:Han Jang,Junhyeok Lee,Heeseong Eum,Kyu Sung Choi

类目:Computation and Language (cs.CL)

关键词:interpreting diagnostic imaging, achieved expert-level proficiency, Medical Vision-Language Models, diagnostic imaging, achieved expert-level

备注: Accepted at ACL 2026 Findings (Oral). 9 pages, 5 figures, 11 tables, plus appendix

点击查看摘要

Abstract:Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

44. 【2604.05711】SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

链接https://arxiv.org/abs/2604.05711

作者:Guan-Yan Yang,Wei-Ling Wen,Shu-Yuan Ku,Farn Wang,Kuo-Hui Yeh

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:applications rely heavily, connect disparate information, Web applications rely, disparate information resources, applications rely

备注: Accepted at the 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026, Daejeon, Republic of Korea

点击查看摘要

Abstract:Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

45. 【2604.05702】Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

链接https://arxiv.org/abs/2604.05702

作者:Liqun He,Shijun(Cindy)Chen,Mutlu Cukurova,Manolis Mavrikis

类目:Computation and Language (cs.CL)

关键词:gains remain underexplored, offer scalable opportunities, interactional processes related, learners' gains remain, chatbots offer scalable

备注: Accepted for publication as a full paper (Main Track) at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners' gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

46. 【2604.05688】Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

链接https://arxiv.org/abs/2604.05688

作者:Zhen Cheng,Hao-Bo Yang,Wan-Yi Huang,Jin-Long Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:bandwidth increasingly dominate, model inference cost, increasingly dominate large, language model inference, dominate large language

备注

点击查看摘要

Abstract:Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target--MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

47. 【2604.05681】LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

链接https://arxiv.org/abs/2604.05681

作者:Ojas Jain,Dhruv Kumar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:meaningful planning complexity, home-path progression introduce, progression introduce meaningful, introduce meaningful planning, safe-square navigation

备注: Under Review

点击查看摘要

Abstract:We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at this https URL

48. 【2604.05655】LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

链接https://arxiv.org/abs/2604.05655

作者:Lihao Sun,Hang Dong,Bo Qiao,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models', work characterizes large, characterizes large language, language models', representation space

备注: ACL 2026 (Main)

点击查看摘要

Abstract:This work characterizes large language models' chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.

49. 【2604.05650】See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

链接https://arxiv.org/abs/2604.05650

作者:Yicheng Ji,Jun Zhang,Jinpeng Chen,Cong Wang,Lidan Shou,Gang Chen,Huan Li

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Video Large Language, Language Models, Large Language, Video Large

备注: ACL'2026 MainConference

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves 99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

50. 【2604.05643】Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

链接https://arxiv.org/abs/2604.05643

作者:Hongyuan Yuan,Xinran He,Run Shao,Bolei He,Xianwei Xue,Mengke Chen,Qiutong Pan,Haiwei Wang,Haifeng Li

类目:Computation and Language (cs.CL)

关键词:capabilities of LLMs, Extending CoT, Extending, reasoning capabilities, reflection

备注

点击查看摘要

Abstract:Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42\% while maintaining or improving accuracy.

51. 【2604.05624】YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

链接https://arxiv.org/abs/2604.05624

作者:Peace Busola Falola,Jesujoba O. Alabi,Solomon O. Akinola,Folashade T. Ogunajo,Emmanuel Oluwadunsin Alabi,David Ifeoluwa Adelani

类目:Computation and Language (cs.CL)

关键词:Named Entity Recognition, foundational NLP task, NLP task, foundational NLP, Named Entity

备注: LREC 2026

点击查看摘要

Abstract:Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.

52. 【2604.05623】DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

链接https://arxiv.org/abs/2604.05623

作者:Xinran Wang,Yuxuan Zhang,Xiao Zhang,Haolong Yan,Muxi Diao,Songyu Xu,Zhonghao Yan,Hongbing Li,Kongming Liang,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:ensuring high reliability, Accurately detecting, Multimodal Large Language, Large Language Models, detecting and localizing

备注: 8 pages, 5 figures. The dataset and code are available at [this https URL](https://zyx-hhnkh.github.io/DetailVerifyBench/)

点击查看摘要

Abstract:Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at this https URL.

53. 【2604.05605】INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

链接https://arxiv.org/abs/2604.05605

作者:Nikolaos D. Tantaroudas,Andrew J. McCracken,Ilias Karachalios,Evangelos Papatheou

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:offer limited support, World Health Organisation, Video conferencing, platforms offer limited, Health Organisation estimates

备注: 20

点击查看摘要

Abstract:Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].

54. 【2604.05593】Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

链接https://arxiv.org/abs/2604.05593

作者:Xin Sun,Di Wu,Sijing Qin,Isao Echizen,Abdallah El Ali,Saku Sugawara

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, automated evaluators, Large, labels

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

55. 【2604.05591】AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering

链接https://arxiv.org/abs/2604.05591

作者:N.D. Tantaroudas,A.J. McCracken,I. Karachalios,E. Papatheou

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)

关键词:International Sign, rendering through Google, Google MediaPipe, OpenAI Whisper, automatic speech recognition

备注: 21

点击查看摘要

Abstract:This work introduces a modular platform that brings together six AI services, automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan t5 base samsum, and International Sign (IS) rendering through Google MediaPipe. A corpus of IS gesture recordings was processed to derive hand landmark coordinates, which were subsequently mapped onto three dimensional avatar animations inside a virtual reality (VR) environment. Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants). Technical evaluations confirmed the suitability of the platform for real time XR deployment. Speech synthesis benchmarking established that AWS Polly delivers the lowest latency at a competitive price point. The EuroLLM 1.7B Instruct variant attained a higher BLEU score, surpassing NLLB. These findings establish the viability of orchestrating cross modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

56. 【2604.05564】HIVLVC: Retrieval Augmented Dependency Parsing for Latin

链接https://arxiv.org/abs/2604.05564

作者:Luc Pommeret(STL),Thibault Wagret(ENS de Lyon, HiSoMA),Jules Deret

类目:Computation and Language (cs.CL)

关键词:Dependency Parsing task, Dependency Parsing, Parsing task, POS n-gram similarity, describe THIVLVC

备注

点击查看摘要

Abstract:We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

57. 【2604.05557】EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

链接https://arxiv.org/abs/2604.05557

作者:Xuan Dong,Huanyang Zheng,Tianhao Niu,Zhe Han,Pengzhan Li,Bofei Liu,Zhengyang Liu,Guancheng Li,Qingfu Zhu,Wanxiang Che

类目:Computation and Language (cs.CL)

关键词:require proactively searching, align experimental settings, support reproducible conclusions, Scientific research, searching the literature

备注

点击查看摘要

Abstract:Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

58. 【2604.05552】Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

链接https://arxiv.org/abs/2604.05552

作者:Junan Hu,Shudan Guo,Wenqi Liu,Jianhua Yin,Yinwei Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, face fundamental challenges, Language Models demonstrate, face fundamental

备注: 14 pages, 7 figures, ACL 2026

点击查看摘要

Abstract:Large Language Models demonstrate outstanding performance in many language tasks but still face fundamental challenges in managing the non-linear flow of human conversation. The prevalent approach of treating dialogue history as a flat, linear sequence is misaligned with the intrinsically hierarchical and branching structure of natural discourse, leading to inefficient context utilization and a loss of coherence during extended interactions involving topic shifts or instruction refinements. To address this limitation, we introduce Context-Agent, a novel framework that models multi-turn dialogue history as a dynamic tree structure. This approach mirrors the inherent non-linearity of conversation, enabling the model to maintain and navigate multiple dialogue branches corresponding to different topics. Furthermore, to facilitate robust evaluation, we introduce the Non-linear Task Multi-turn Dialogue (NTM) benchmark, specifically designed to assess model performance in long-horizon, non-linear scenarios. Our experiments demonstrate that Context-Agent enhances task completion rates and improves token efficiency across various LLMs, underscoring the value of structured context management for complex, dynamic dialogues. The dataset and code is available at GitHub.

59. 【2604.05551】FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation--Full Version

链接https://arxiv.org/abs/2604.05551

作者:Dat Nguyen-Cong,Tung Kieu,Hoang Thanh-Tung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:correct previous errors, correct previous, continuous diffusion language, diffusion language models, diffusion language

备注: camera-ready version, accepted by ACL Findings (ACL 2026)

点击查看摘要

Abstract:Self-conditioning has been central to the success of continuous diffusion language models, as it allows models to correct previous errors. Yet its ability degrades precisely in the regime where diffusion is most attractive for deployment: few-step sampling for fast inference. In this study, we show that when models only have a few denoising steps, inaccurate self-conditioning induces a substantial approximation gap; this mistake compounds across denoising steps and ultimately dominate the sample quality. To address this, we propose a novel training framework that handles these errors during learning by perturbing the self-conditioning signal to match inference noise, improving robustness to prior estimation errors. In addition, we introduce a token-level noise-awareness mechanism that prevents training from saturation, hence improving optimization. Extensive experiments across conditional generation benchmarks demonstrate that our framework surpasses standard continuous diffusion models while providing up to 400x faster inference speed, and remains competitive against other one-step diffusion frameworks.

60. 【2604.05550】AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

链接https://arxiv.org/abs/2604.05550

作者:Yu Li,Chenyang Shao,Xinyang Liu,Ruotong Zhao,Peijie Liu,Hongyuan Su,Zhibin Chen,Qinglong Yang,Anjie Xu,Yi Fang,Qingbin Zeng,Tianxing Li,Jingbo Xu,Fengli Xu,Yong Li,Tie-Yan Liu

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

关键词:Artificial intelligence research, SOTA models, Artificial intelligence, intelligence research increasingly, research increasingly depends

备注

点击查看摘要

Abstract:Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

61. 【2604.05549】Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

链接https://arxiv.org/abs/2604.05549

作者:Yanxu Mao,Peipei Liu,Tiehan Cui,Congying Liu,Mingzhe Xing,Datao You

类目:Computation and Language (cs.CL)

关键词:security threats, widespread application, application of LLM-based, complexity has introduced, introduced new security

备注

点击查看摘要

Abstract:With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent's performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent's reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

62. 【2604.05546】Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

链接https://arxiv.org/abs/2604.05546

作者:Jun Zhang,Yicheng Ji,Feiyang Ren,Yihang Li,Bowen Zeng,Zonghao Chen,Ke Chen,Lidan Shou,Gang Chen,Huan Li

类目:Computation and Language (cs.CL)

关键词:Large Vision-Language Models, enable sophisticated reasoning, Large Vision-Language, Vision-Language Models, visual token dominance

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

63. 【2604.05540】Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

链接https://arxiv.org/abs/2604.05540

作者:Jinhu Fu,Yan Bai,Longzhu He,Yihang Lou,Yanxiao Zhao,Li Sun,Sen Su

类目:Computation and Language (cs.CL)

关键词:effectively handle outdated, Large language models, handle outdated information, Large language, handle outdated

备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at this https URL.

64. 【2604.05536】urbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

链接https://arxiv.org/abs/2604.05536

作者:Zhongxin Yang,Chun Bao,Yuanwei Bin,Xiang I.A. Yang,Shiyi Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:robust statistical regularities, statistical regularities, Natural language, exhibits robust statistical, Natural

备注

点击查看摘要

Abstract:Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to $5/3$ over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.

65. 【2604.05522】Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

链接https://arxiv.org/abs/2604.05522

作者:Hongcheng Liu,Yuhao Wang,Zhe Chen,Pingjie Wang,Zhiyuan Zhu,Yixuan Hou,Yanfeng Wang,Yu Wang

类目:Computation and Language (cs.CL)

关键词:Omni Large Language, Large Language Models, Omni Large, Large Language, holistic multi-modal perception

备注

点击查看摘要

Abstract:Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

66. 【2604.05483】Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

链接https://arxiv.org/abs/2604.05483

作者:Xiaotian Zhou,Di Tang,Xiaofeng Wang,Xiaozhong Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, shown a high, high capability

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

67. 【2604.05477】Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

链接https://arxiv.org/abs/2604.05477

作者:Yuzhe Zhang,Xianwei Xue,Xingyong Wu,Mengke Chen,Chen Liu,Xinran He,Run Shao,Feiran Liu,Huanmin Xu,Qiutong Pan,Haiwei Wang

类目:Computation and Language (cs.CL)

关键词:Autonomous GUI agents, previous operations succeeded, deterministic environment responses, Verification-driven GUI Agent, assume deterministic environment

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline this http URL propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking--Verification--Action--Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

68. 【2604.05467】CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.05467

作者:Siddharth Jain,Venkat Narayan Vedam

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:single-shot answer generation, consumes evidence mid-inference, language models shift, evaluating the role, language models

备注: 6 figures, 14 tables; appendix includes bootstrap CIs, metric definitions, duplicate position sensitivity, prompt template, and reproducibility details

点击查看摘要

Abstract:As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

69. 【2604.05461】Content Fuzzing for Escaping Information Cocoons on Digital Social Media

链接https://arxiv.org/abs/2604.05461

作者:Yifeng He,Ziye Tang,Hao Chen

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:social media limit, media limit users', limit users' exposure, Information cocoons, diverse viewpoints

备注: accepted to findings of ACL 2026

点击查看摘要

Abstract:Information cocoons on social media limit users' exposure to posts with diverse viewpoints. Modern platforms use stance detection as an important signal in recommendation and ranking pipelines, which can route posts primarily to like-minded audiences and reduce cross-cutting exposure. This restricts the reach of dissenting opinions and hinders constructive discourse. We take the creator's perspective and investigate how content can be revised to reach beyond existing affinity clusters. We present ContentFuzz, a confidence-guided fuzzing framework that rewrites posts while preserving their human-interpreted intent and induces different machine-inferred stance labels. ContentFuzz aims to route posts beyond their original cocoons. Our method guides a large language model (LLM) to generate meaning-preserving rewrites using confidence feedback from stance detection models. Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels, while maintaining semantic integrity with respect to the original content.

70. 【2604.05445】Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

链接https://arxiv.org/abs/2604.05445

作者:Qiyuan Chen,Hongsen Huang,Jiahe Chen,Qian Shao,Jintai Chen,Hongxia Xu,Renjie Hua,Chuan Ren,Jian Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:reward modeling faces, Vision-language reward modeling, black boxes, faces a dilemma, generative approaches

备注: ACL 2026 Main

点击查看摘要

Abstract:Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

71. 【2604.05438】op-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

链接https://arxiv.org/abs/2604.05438

作者:Yasuto Hoshi,Daisuke Miyashita,Jun Deguchi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:GPU memory, offloaded beyond GPU, decode-time key-value, generation is increasingly, increasingly limited

备注

点击查看摘要

Abstract:Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

72. 【2604.05429】Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset

链接https://arxiv.org/abs/2604.05429

作者:Tinko Sebastian Bartels,Ruixiang Wu,Xinyu Lu,Yikai Lu,Fanzeng Xia,Haoxiang Yang,Yue Chen,Tongxin Li

类目:ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:unstructured contextual information, open-source digital twin, digital twin explicitly, twin explicitly designed, renewable energy dynamics

备注

点击查看摘要

Abstract:Addressing the critical need for intelligent, context-aware energy management in renewable systems, we introduce the \textbf{OpenCEM Simulator and Dataset}: the first open-source digital twin explicitly designed to integrate rich, unstructured contextual information with quantitative renewable energy dynamics. Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions). OpenCEM bridges this gap by offering a unique platform comprising both a meticulously aligned, language-rich dataset from a real-world PV-and-battery microgrid installation and a modular simulator capable of natively processing this multi-modal context. The OpenCEM Simulator provides a high-fidelity environment for developing and validating novel control algorithms and prediction models, particularly those leveraging Large Language Models. We detail its component-based architecture, hybrid data-driven and physics-based modelling capabilities, and demonstrate its utility through practical examples, including context-aware load forecasting and the implementation of online optimal battery charging control strategies. By making this platform publicly available, OpenCEM aims to accelerate research into the next generation of intelligent, sustainable, and truly context-aware energy systems.

73. 【2604.05424】PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection

链接https://arxiv.org/abs/2604.05424

作者:Siyuan Cheng,Bozhong Tian,YanChao Hao,Zheng Wei

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Zheng Wei Published, Question Answering Abstract, Reflection Siyuan Cheng, Metacognitive Reflection Siyuan, Authors Revisions BibTeX

备注

点击查看摘要

Abstract:PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei Published: 06 Apr 2026, Last Modified: 06 Apr 2026 ACL 2026 Findings Conference, Area Chairs, Reviewers, Publication Chairs, Authors Revisions BibTeX CC BY 4.0 Keywords: Efficient/Low-Resource Methods for NLP, Generation, Question Answering Abstract: The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations. To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. PRISM-MCTS integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both "Heuristics" and "Fallacies". By reinforcing successful strategies and pruning error-prone branches, PRISM-MCTS effectively achieves refinement. Furthermore, we develop a data-efficient training strategy for the PRM, achieving high-fidelity evaluation under a few-shot regime. Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, it halves the trajectory requirements on GPQA while surpassing MCTS-RAG and Search-o1, demonstrating that it scales inference by reasoning judiciously rather than exhaustively.

74. 【2604.05417】Multi-Drafter Speculative Decoding with Alignment Feedback

链接https://arxiv.org/abs/2604.05417

作者:Taehyeon Kim,Hojung Jung,Se-Young Yun

类目:Computation and Language (cs.CL)

关键词:accelerates large language, Speculative decoding, large language model, draft future tokens, target LLM

备注: ACL 2026 Findings

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens. However, individual drafters, often trained for specific tasks or domains, exhibit limited effectiveness across diverse applications. To address this, we introduce \textsc{MetaSD}, a unified framework that integrates multiple drafters into the SD process. MetaSD dynamically allocates computational resources to heterogeneous drafters by leveraging alignment feedback and framing drafter selection as a multi-armed bandit problem. Extensive experiments show MetaSD consistently outperforms single-drafter approaches.

75. 【2604.05397】Confidence Should Be Calibrated More Than One Turn Deep

链接https://arxiv.org/abs/2604.05397

作者:Zhaohan Zhang,Chengzhengxu Li,Xiaoming Liu,Chao Shen,Ziquan Liu,Ioannis Patras

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, multi-turn, multi-turn calibration

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

76. 【2604.05387】Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

链接https://arxiv.org/abs/2604.05387

作者:Xing Tang,Hao Chen,Shiwei Li,Fuyuan Lyu,Weijie Shi,Lingjie Li,Dugang Liu,Weihong Luo,Xiku Du,Xiuqiang He

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large language models, numerous industrial applications, Large language, financial, industrial applications

备注: Accepted to Webconf 2026 industry track

点击查看摘要

Abstract:Large language models (LLMs) have been incorporated into numerous industrial applications. Meanwhile, a vast array of API assets is scattered across various functions in the financial domain. An online financial question-answering system can leverage both LLMs and private APIs to provide timely financial analysis and information. The key is equipping the LLM model with function calling capability tailored to a financial scenario. However, a generic LLM requires customized financial APIs to call and struggles to adapt to the financial domain. Additionally, online user queries are diverse and contain out-of-distribution parameters compared with the required function input parameters, which makes it more difficult for a generic LLM to serve online users. In this paper, we propose a data-driven pipeline to enhance function calling in LLM for our online, deployed financial QA, comprising dataset construction, data augmentation, and model training. Specifically, we construct a dataset based on a previous study and update it periodically, incorporating queries and an augmentation method named AugFC. The addition of user query-related samples will \textit{exploit} our financial toolset in a data-driven manner, and AugFC explores the possible parameter values to enhance the diversity of our updated dataset. Then, we train an LLM with a two-step method, which enables the use of our financial functions. Extensive experiments on existing offline datasets, as well as the deployment of an online scenario, illustrate the superiority of our pipeline. The related pipeline has been adopted in the financial QA of YuanBao\footnote{this https URL}, one of the largest chat platforms in China.

77. 【2604.05378】ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

链接https://arxiv.org/abs/2604.05378

作者:Kaiser Hamid,Can Cui,Nade Liang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:evaluations largely assume, Recent progress, execute natural-language navigation, natural-language navigation commands, largely assume instructions

备注

点击查看摘要

Abstract:Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

78. 【2604.05355】ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2604.05355

作者:Xuan Xiong,Huan Liu,Li Gu,Zhixiang Chi,Yue Qiu,Yuanhao Yu,Yang Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:improves large language, produces excessively long, inefficient reasoning traces, reasoning improves large, large language model

备注: ACL 2026 (Main)

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at this https URL

79. 【2604.05350】DQA: Diagnostic Question Answering for IT Support

链接https://arxiv.org/abs/2604.05350

作者:Vishaal Kapoor,Mariam Dundua,Sarthak Ahuja,Neda Kordjazi,Evren Yortucboylu,Vaibhavi Padala,Derek Ho,Jennifer Whitted,Rebecca Steinert

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:effective resolution requires, resolution requires iterative, ambiguous user reports, requires iterative evidence, iterative evidence gathering

备注: 7 pages, 2 tables, accepted at ACL 2026 Industry Track

点击查看摘要

Abstract:Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

80. 【2604.05339】Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

链接https://arxiv.org/abs/2604.05339

作者:Xiangxu Zhang,Jiamin Wang,Qinlin Zhao,Hanze Guo,Linzhuo Li,Jing Yao,Xiao Zhou,Xiaoyuan Yi,Xing Xie

类目:Computation and Language (cs.CL)

关键词:drawn growing attention, growing attention, increasingly integrated, drawn growing, human society

备注

点击查看摘要

Abstract:As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community's collective dynamics, including those diverging from LLMs' original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.

81. 【2604.05318】DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

链接https://arxiv.org/abs/2604.05318

作者:Jason Lucas,Matt Murtagh,Ali Al-Lawati,Uchendu Uchendu,Adaku Uchendu,Dongwon Lee

类目:Computation and Language (cs.CL)

关键词:Standard American English, Standard American, Harmful content detectors-particularly, classifiers-are predominantly developed, American English

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: this https URL

82. 【2604.05306】LLMs Should Express Uncertainty Explicitly

链接https://arxiv.org/abs/2604.05306

作者:Junyu Guo,Shangding Gu,Ming Jin,Costas Spanos,Javad Lavaei

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, drive decisions, Large, language models

备注

点击查看摘要

Abstract:Large language models are increasingly used in settings where uncertainty must drive decisions such as abstention, retrieval, and verification. Most existing methods treat uncertainty as a latent quantity to estimate after generation rather than a signal the model is trained to express. We instead study uncertainty as an interface for control. We compare two complementary interfaces: a global interface, where the model verbalizes a calibrated confidence score for its final answer, and a local interface, where the model emits an explicit uncertain marker during reasoning when it enters a high-risk state. These interfaces provide different but complementary benefits. Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest overall Adaptive RAG controller while using retrieval more selectively. Reasoning-time uncertainty signaling makes previously silent failures visible during generation, improves wrong-answer coverage, and provides an effective high-recall retrieval trigger. Our findings further show that the two interfaces work differently internally: verbal confidence mainly refines how existing uncertainty is decoded, whereas reasoning-time signaling induces a broader late-layer reorganization. Together, these results suggest that effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust a final answer, and local signals for deciding when intervention is needed.

83. 【2604.05302】Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

链接https://arxiv.org/abs/2604.05302

作者:Jinhong Jeong,Junghun Park,Youngjae Yu

类目:Computation and Language (cs.CL)

关键词:providing comprehensible input, Input Hypothesis, comprehensible input, Text simplification supports, providing comprehensible

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

84. 【2604.05273】Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext

链接https://arxiv.org/abs/2604.05273

作者:Kabir Ahuja,Yuxuan Li,Andrew Kyle Lampinen

类目:Computation and Language (cs.CL)

关键词:implied meaning, Human communication, Human, literal content, literal clues

备注

点击查看摘要

Abstract:Human communication is fundamentally creative, and often makes use of subtext -- implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints -- even the best performing models generate literal clues 60% of times in one of our environments -- Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

85. 【2604.05268】Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

链接https://arxiv.org/abs/2604.05268

作者:Chan-Wei Hu,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-modal retrieval-augmented generation, Multi-modal retrieval-augmented, retrieval-augmented generation, relies heavily, image-question queries

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

86. 【2604.05267】Do Domain-specific Experts exist in MoE-based LLMs?

链接https://arxiv.org/abs/2604.05267

作者:Giang Do,Hung Le,Truyen Tran

类目:Computation and Language (cs.CL)

关键词:Large Language Models, extremely large models, training extremely large, improved computational efficiency, Large Language

备注: 15 pages

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textit{Do domain-specific experts exist in MoE-based LLMs?} To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbf{Domain Steering Mixture of Experts (DSMoE)}, a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at this https URL.

87. 【2604.05250】DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

链接https://arxiv.org/abs/2604.05250

作者:Satyam Goyal,Kushal Patel,Tanush Mittal,Arjun Laxman

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:enabling parallel token, bidirectional context modeling, parallel token generation, offer a promising, context modeling

备注

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the inability to cache key-value pairs due to bidirectional attention, requiring $O(N^2)$ computations at each generation step. While recent methods like FastDLLM and DkvCache improve inference speed through attention approximations and caching strategies, they achieve speedups at the cost of generation quality. We propose DualDiffusion, a speculative decoding framework for MDMs that combines fast drafter models (using efficient approximations) with slower, more accurate verifier models. By running multiple steps of a lightweight drafter followed by a single verification step, DualDiffusion achieves a superior Pareto frontier between generation steps and accuracy compared to existing approaches. We evaluate our method on MMLU and GSM8K, demonstrating that DualDiffusion maintains high accuracy while reducing the number of generation steps required, effectively pushing the quality-efficiency trade-off curve for masked diffusion language models.

88. 【2604.05248】Improving Sparse Memory Finetuning

链接https://arxiv.org/abs/2604.05248

作者:Satyam Goyal,Anirudh Kanchi,Garv Shah,Prakhar Gupta

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, real-world applications require, require continual adaptation, applications require continual

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

89. 【2604.05243】Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning

链接https://arxiv.org/abs/2604.05243

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:blocks are square, balls are round, round and blocks, Background, simply learn

备注: 27 pages, 7 figures, 22 references. Pre-registered study (OSF: [this https URL](https://osf.io/qj9hb/) ). Code and data: [this https URL](https://github.com/synthiumjp/overhypothesis) . Submitted to Cognitive Computation

点击查看摘要

Abstract:Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories -- a second-order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre-registered runs evaluated on a 1,040-item wug test battery, every model achieved perfect first-order exemplar retrieval (100%) while second-order generalisation to novel nouns remained at chance (50-52%), a result confirmed by equivalence testing. A feature-swap diagnostic revealed that models rely on frame-to-feature template matching rather than structured noun-to-domain-to-feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental-scale training conditions.

90. 【2604.05242】XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

链接https://arxiv.org/abs/2604.05242

作者:Jiahao Xu,Rui Hu,Olivera Kotevska,Zikai Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Large Language Model, Language Model, enabling reliable attribution, Multi-bit watermarking, Large Language

备注: Accepted by ACL 2026 as a main conference paper

点击查看摘要

Abstract:Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textsc{XMark}, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textsc{XMark}'s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textsc{XMark} significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at this https URL.

91. 【2604.05226】RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

链接https://arxiv.org/abs/2604.05226

作者:Yi Ru Wang,Carter Ung,Evan Gubarev,Christopher Tan,Siddhartha Srinivasa,Dieter Fox

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:robotic manipulation systems, number of experts, difficult to extend, systems has largely, largely relied

备注: Yi Ru Wang and Carter Ung contributed equally

点击查看摘要

Abstract:Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success. We argue that evaluating modern manipulation policies requires reframing evaluation as a language-driven process over structured physical domains. We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. We instantiate RoboPlayground in a structured block manipulation domain and evaluate it along three axes. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures that are not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions. Project Page: this https URL

92. 【2604.05217】On the Geometry of Positional Encodings in Transformers

链接https://arxiv.org/abs/2604.05217

作者:Giansalvo Cirrincione

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:mathematical operations inside, language models process, models process sequences, Neural language models, models process

备注

点击查看摘要

Abstract:Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) = n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2604.05217 [cs.LG]

(or
arXiv:2604.05217v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.05217

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
93. 【2604.05192】Faster Superword Tokenization

链接https://arxiv.org/abs/2604.05192

作者:Craig W. Schmidt,Chris Tanner,Yuval Pinter

类目:Computation and Language (cs.CL)

关键词:Byte Pair Encoding, Byte Pair, Pair Encoding, pre-tokenization boundaries, functionally limiting

备注

点击查看摘要

Abstract:Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

94. 【2604.05190】Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

链接https://arxiv.org/abs/2604.05190

作者:Ziyi Chen,Mengxian Lyu,Cheng Peng,Yonghui Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:labor-intensive bottleneck, patients for enrollment, bottleneck that leads, leads to under-enrollment, LLMs

备注

点击查看摘要

Abstract:Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the "Lost in the Middle" issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

95. 【2604.05179】Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

链接https://arxiv.org/abs/2604.05179

作者:Purva Chiniya,Kevin Scaria,Sagar Chaturvedi

类目:Computation and Language (cs.CL)

关键词:Large language models, degrade user experience, defensive filters frequently, filters frequently over-refuse, frequently over-refuse benign

备注: Accepted at LREC2026

点击查看摘要

Abstract:Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

96. 【2604.05163】What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

链接https://arxiv.org/abs/2604.05163

作者:Jonathan Ivey,Anjalie Field,Ziang Xiao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:elicit high-quality responses, human experiences, elicit high-quality, Qualitative Interview Corpus, interview

备注: 24 pages, 14 figures

点击查看摘要

Abstract:Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to the study's goals. In this work, we identify, implement, and evaluate 10 proposed measures of interview response quality to determine which are actually predictive of a response's contribution to the study findings. To conduct our analysis, we introduce the Qualitative Interview Corpus, a newly constructed dataset of 343 interview transcripts with 16,940 participant responses from 14 real research projects. We find that direct relevance to a key research question is the strongest predictor of response quality. We additionally find that two measures commonly used to evaluate NLP interview systems, clarity and surprisal-based informativeness, are not predictive of response quality. Our work provides analytic insights and grounded, scalable metrics to inform the design of qualitative studies and the evaluation of automated interview systems.

97. 【2604.05159】Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

链接https://arxiv.org/abs/2604.05159

作者:Alfonso Amayuelas,Firas Laakom,Piotr Piękos,Wenyi Wang,Yifan Xu,Yuhui Wang,Jürgen Schmidhuber,William Wang

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:testing and evaluation, naturally extended, test generation, code testing, automated test generation

备注

点击查看摘要

Abstract:The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program's branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

98. 【2604.05158】Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

链接https://arxiv.org/abs/2604.05158

作者:Ahmed Ewais,Ahmed Hashish,Amr Ali

类目:Computation and Language (cs.CL)

关键词:Large language models, models encode extensive, encode extensive world, extensive world knowledge, world knowledge valuable

备注: 16 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Large language models encode extensive world knowledge valuable for zero-shot named entity recognition. However, their causal attention mechanism, where tokens attend only to preceding context, prevents effective token classification when disambiguation requires future context. Existing approaches use LLMs generatively, prompting them to list entities or produce structured outputs, but suffer from slow autoregressive decoding, hallucinated entities, and formatting errors. We propose Just Pass Twice (JPT), a simple yet effective method that enables causal LLMs to perform discriminative token classification with full bidirectional context. Our key insight is that concatenating the input to itself lets each token in the second pass attend to the complete sentence, requiring no architectural modifications. We combine these representations with definition-guided entity embeddings for flexible zero-shot generalization. Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods.

Comments:
16 pages, 9 figures, 12 tables

Subjects:

Computation and Language (cs.CL)

ACMclasses:
I.2.7

Cite as:
arXiv:2604.05158 [cs.CL]

(or
arXiv:2604.05158v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.05158

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
99. 【2604.05149】EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

链接https://arxiv.org/abs/2604.05149

作者:Jiatan Huang,Zheyuan Zhang,Kaiwen Shi,Yanfang Ye,Chuxu Zhang

类目:Computation and Language (cs.CL)

关键词:Large language model, exhibit complementary strengths, Large language, language model agents, complementary strengths

备注

点击查看摘要

Abstract:Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.

100. 【2604.05137】EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback

链接https://arxiv.org/abs/2604.05137

作者:Samira Hajizadeh,Suman Jana

类目:Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:Large language models, functionally correct, correct but inefficient, generate code, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) often generate code that is functionally correct but inefficient in runtime and memory. Prior approaches to improving code efficiency typically rely on absolute execution feedback, such as profiling a single program's runtime or memory usage, which is costly and provides weak guidance for refinement. We propose Relative Contrastive Feedback (RCF), an inference-time feedback mechanism that requires no model fine-tuning or parameter updates. RCF compares two structurally similar programs for the same task and highlights the differences associated with better efficiency. Building on this idea, we introduce EffiPair, an inference-time iterative refinement framework that operates entirely at test time by generating multiple candidate solutions, identifying informative program pairs with large efficiency gaps, summarizing their execution differences into lightweight feedback, and using this signal to produce more efficient solutions. By replacing isolated scalar feedback with pairwise contrastive comparisons, EffiPair provides more direct guidance while reducing profiling and prompting overhead. Experiments on code-efficiency benchmarks show that EffiPair consistently improves efficiency while preserving correctness. For instance, with DeepSeek-Chat V3.2, EffiPair achieves up to 1.5x speedup over generation without performance feedback, while reducing token usage by more than 90% compared to prior work.

101. 【2604.05135】SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

链接https://arxiv.org/abs/2604.05135

作者:Berny Kabalisa

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

关键词:full reasoning process, sentiment dataset designed, validated financial sentiment, designed to capture, financial sentiment dataset

备注: Dataset available on request (bernykabalisa18@gmail.com) See GitHub for dataset snapshot and automated data collection script demo [this https URL](https://github.com/bernykabalisa18-netizen/SenseAI)

点击查看摘要

Abstract:We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.

Comments:
Dataset available on request (bernykabalisa18@gmail.com) See GitHub for dataset snapshot and automated data collection script demo this https URL

Subjects:

Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

Cite as:
arXiv:2604.05135 [cs.CL]

(or
arXiv:2604.05135v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.05135

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
102. 【2604.05125】Offline RL for Adaptive Policy Retrieval in Prior Authorization

链接https://arxiv.org/abs/2604.05125

作者:Ruslan Sharifullin,Maxim Gorshkov,Hannah Clay

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Prior authorization, existing retrieval-augmented systems, retrieval-augmented systems rely, requires interpretation, retrieved sections

备注: 9 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL's 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a "selective-accurate" region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs $\lambda \in \{0.05, 0.1, 0.2\}$ reveals a clear accuracy-efficiency inflection: only at $\lambda = 0.2$ does CQL transition from exhaustive to selective retrieval.

103. 【2604.05117】Watch Before You Answer: Learning from Visually Grounded Post-Training

链接https://arxiv.org/abs/2604.05117

作者:Yuxuan Zhang,EunJeong Hwang,Huaisong Zhang,Penghui Du,Yiming Jia,Dongfu Jiang,Xuan He,Shenhui Zhang,Ping Nie,Peter West,Kelsey R. Allen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:comprehensively understand visual, vision-language models, video understanding, critical for vision-language, comprehensively understand

备注

点击查看摘要

Abstract:It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: this http URL.

104. 【2604.05114】$π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models

链接https://arxiv.org/abs/2604.05114

作者:Quyet V. Do,Thinh Pham,Nguyen Nguyen,Sha Li,Pratibha Zunjare,Tu Vu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:initial structured data, large language models, curates reasoning data, improving long-context reasoning, study a pipeline

备注: Our structured analytical reasoning data, which originates from Wikipedia tables, significantly improves long-context reasoning capability of LLMs

点击查看摘要

Abstract:We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $\pi^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $\pi^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $\pi^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $\pi^2$'s usefulness. Our code, data, and models are open-source at this https URL.

105. 【2604.05096】RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

链接https://arxiv.org/abs/2604.05096

作者:Hanbing Liu,Lang Cao,Yang Li

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, evolving knowledge challenging, continuous knowledge drift, continuously evolving knowledge

备注

点击查看摘要

Abstract:Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change over time, models may experience continuous knowledge drift, resulting not only in outdated predictions but also in temporally inconsistent reasoning. Although existing approaches, such as continual finetuning, knowledge editing, and retrieval-augmented generation (RAG), aim to update or supplement model knowledge, they are rarely evaluated in settings that reflect chronological, evolving, and real-world knowledge evolution. In this work, we introduce a new benchmark of real-world dynamic events, constructed from time-stamped evidence that captures how knowledge evolves over time, which enables systematic evaluation of model adaptation under continuous knowledge drift. The benchmark reveals that most existing methods, including vanilla RAG and several learning-based approaches, struggle under this setting, exposing critical limitations such as catastrophic forgetting and temporal inconsistency. To mitigate these limitations, we propose a time-aware retrieval baseline, Chronos, which progressively organizes retrieved evidence into an Event Evolution Graph to enable more temporally consistent understanding in LLMs without additional training. Overall, this work provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings.

106. 【2604.05091】MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

链接https://arxiv.org/abs/2604.05091

作者:Zhengqing Yuan,Hanchi Sun,Lichao Sun,Yanfang Ye

类目:Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)

关键词:parameter large language, large language models, large language, full precision, memory-centric system

备注

点击查看摘要

Abstract:We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

107. 【2604.05090】Multilingual Language Models Encode Script Over Linguistic Structure

链接https://arxiv.org/abs/2604.05090

作者:Aastha A K Verma,Anwoy Chatterjee,Mehak Gupta,Tanmoy Chakraborty

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:shared parameter space, organization remains elusive, internal organization remains, orthographically diverse languages, parameter space

备注: Accepted at ACL 2026 (Main)

点击查看摘要

Abstract:Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. Focusing on compact, distilled models where representational trade-offs are explicit, we analyze language-associated units in Llama-3.2-1B and Gemma-2-2B using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

108. 【2604.05087】Document Optimization for Black-Box Retrieval via Reinforcement Learning

链接https://arxiv.org/abs/2604.05087

作者:Omri Uzan,Ron Polonsky,Douwe Kiela,Christopher Potts

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:shifts computation offline, avoiding additional query-time, additional query-time processing, computation offline, avoiding additional

备注

点击查看摘要

Abstract:Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever's ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.

109. 【2604.05083】Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

链接https://arxiv.org/abs/2604.05083

作者:Firoj Alam,Gagan Bhatia,Sahinur Rahman Laskar,Shammur Absar Chowdhury

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:evaluating generated text, Large Language Models, generated text, prompt design, aggregation strategies

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at this https URL

110. 【2604.05075】MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems

链接https://arxiv.org/abs/2604.05075

作者:Frazier N. Baker,Trieu Nguyen,Reza Averly,Botao Yu,Daniel Adu-Ampratwum,Huan Sun,Xia Ning

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-objective retrosynthesis planning, requiring dynamic balancing, Multi-objective retrosynthesis, retrosynthesis planning, critical chemistry task

备注: 36 pages, 1 figure

点击查看摘要

Abstract:Multi-objective retrosynthesis planning is a critical chemistry task requiring dynamic balancing of quality, safety, and cost objectives. Language model-based multi-agent systems (MAS) offer a promising approach for this task: leveraging interactions of specialized agents to incorporate multiple objectives into retrosynthesis planning. We present MMORF, a framework for constructing MAS for multi-objective retrosynthesis planning. MMORF features modular agentic components, which can be flexibly combined and configured into different systems, enabling principled evaluation and comparison of different system designs. Using MMORF, we construct two representative MAS: MASIL and RFAS. On a newly curated benchmark consisting of 218 multi-objective retrosynthesis planning tasks, MASIL achieves strong safety and cost metrics on soft-constraint tasks, frequently Pareto-dominating baseline routes, while RFAS achieves a 48.6% success rate on hard-constraint tasks, outperforming state-of-the-art baselines. Together, these results show the effectiveness of MMORF as a foundational framework for exploring MAS for multi-objective retrosynthesis planning. Code and data are available at this https URL.

111. 【2604.05074】Memory Dial: A Training Framework for Controllable Memorization in Language Models

链接https://arxiv.org/abs/2604.05074

作者:Xiangbo Zhang,Ali Emami

类目:Computation and Language (cs.CL)

关键词:Memory Dial, widely studied, difficult to isolate, memorization pressure, Memorization

备注: Accepted to ACL Findings 2026

点击查看摘要

Abstract:Memorization in language models is widely studied but remains difficult to isolate and control. Understanding when and what models memorize is essential for explaining their predictions, yet existing approaches are post-hoc: they can detect memorization in trained models, but cannot disentangle its effects from architecture, data, or optimization. We introduce Memory Dial, a training framework that makes memorization pressure an explicit, controllable variable. Memory Dial interpolates between standard cross-entropy and a temperature-sharpened objective via a single parameter $\alpha$, producing a family of models identical in architecture and training setup (within each sweep), differing only in memorization pressure. Experiments across six architectures and five benchmarks demonstrate that: (1) $\alpha$ reliably controls memorization pressure, with seen-example accuracy increasing monotonically while unseen accuracy remains stable; (2) larger models are more responsive to memorization pressure; and (3) frequent sequences are easier to memorize than rare ones. Additional analyses show that the effect is robust across a range of sharpening temperatures, differs qualitatively from single-temperature cross-entropy, transfers to multilingual settings, and is detectable even on naturally occurring single-occurrence sequences. Memory Dial provides a controlled experimental framework for studying how memorization behavior emerges and interacts with generalization in language models.

112. 【2604.05051】his Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

链接https://arxiv.org/abs/2604.05051

作者:Hye Sun Yun,Geetika Kapoor,Michael Mackert,Ramez Kouzy,Wei Xu,Junyi Jessy Li,Byron C. Wallace

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, increasingly turning, turning to large, complex and difficult, difficult to articulate

备注: 31 pages, 4 tables, 19 figures

点击查看摘要

Abstract:Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.

113. 【2604.05030】Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

链接https://arxiv.org/abs/2604.05030

作者:Gowrav Vishwakarma,Christopher J. Agostino

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:present Phase-Associative Memory, Phase-Associative Memory, recurrent sequence model, outer products, PAM reaches validation

备注: submitting to APS Open Science, 10 pages, 1 figure, code and training logs available at [this https URL](https://github.com/gowrav-vishwakarma/qllm2)

点击查看摘要

Abstract:We present Phase-Associative Memory (PAM), a recurrent sequence model in which all representations are complex-valued, associations accumulate in a matrix state $S_{t}$ $\in$ $\mathbb{C}^{d \times d}$ via outer products, and retrieval operates through the conjugate inner product $K_t^* \cdot Q_t / \sqrt{d}$. At $\sim$100M parameters on WikiText-103, PAM reaches validation perplexity 30.0, within $\sim$10\% of a matched transformer (27.1) trained under identical conditions, despite $4\times$ arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector-state models, where holographic binding fails due to the $O(1/\sqrt{n})$ capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex-valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non-classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.

114. 【2604.05005】EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

链接https://arxiv.org/abs/2604.05005

作者:Shuzhen Bi,Mingzi Zhang,Zhuoxuan Li,Xiaolong Wang,keqian Li,Aimin Zhou

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, educational capabilities remains, capabilities remains concentrated, Large language, educational assistants

备注

点击查看摘要

Abstract:Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8\%, while Kimi-K2.5 achieves the best cost-efficiency (80.8\% at \\$0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13\% at 94\% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ($\rho \geq 0.83$) while revealing limitations on subjective visual assessment.

115. 【2604.04997】Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

链接https://arxiv.org/abs/2604.04997

作者:Rong Lu,Hao Liu,Song Hou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:geoscience technical documents, classifying geoscience technical, technical documents, work presents, presents a comparative

备注: Accepted at the IMAGE'25 Workshop (PCW-11), Society of Exploration Geophysicists (SEG). Published version available at [this https URL](https://doi.org/10.1190/image2025-w11-03.1)

点击查看摘要

Abstract:This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-Language Models (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.

116. 【2604.04982】CURE:Circuit-Aware Unlearning for LLM-based Recommendation

链接https://arxiv.org/abs/2604.04982

作者:Ziheng Chen,Jiali Cheng,Zezhong Fan,Hadi Amiri,Yunzhi Yao,Xiangguo Sun,Yang Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:enabling rich semantic, rich semantic understanding, Recent advances, large language models, advances in large

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new opportunities for recommender systems by enabling rich semantic understanding and reasoning about user interests and item attributes. However, as privacy regulations tighten, incorporating user data into LLM-based recommendation (LLMRec) introduces significant privacy risks, making unlearning algorithms increasingly crucial for practical deployment. Despite growing interest in LLMRec unlearning, most existing approaches formulate unlearning as a weighted combination of forgetting and retaining objectives while updating model parameters in a uniform manner. Such formulations inevitably induce gradient conflicts between the two objectives, leading to unstable optimization and resulting in either ineffective unlearning or severe degradation of model utility. Moreover, the unlearning procedure remains largely black-box, undermining its transparency and trustworthiness. To tackle these challenges, we propose CURE, a circuit-aware unlearning framework that disentangles model components into functionally distinct subsets and selectively updates them. Here, a circuit refers to a computational subgraph that is causally responsible for task-specific behaviors. Specifically, we extract the core circuits underlying item recommendation and analyze how individual modules within these circuits contribute to the forget and retain objectives. Based on this analysis, these modules are categorized into forget-specific, retain-specific, and task-shared groups, each subject to function-specific update rules to mitigate gradient conflicts during unlearning. Experiments on real-world datasets show that our approach achieves more effective unlearning than existing baselines.

117. 【2604.04949】Learning to Retrieve from Agent Trajectories

链接https://arxiv.org/abs/2604.04949

作者:Yuqi Zhou,Sunhao Dai,Changle Qu,Liang Pang,Jun Xu,Ji-Rong Wen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:methods relying heavily, Information retrieval, large-scale human interaction, human interaction logs, systems have traditionally

备注

点击查看摘要

Abstract:Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

118. 【2604.04944】Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

链接https://arxiv.org/abs/2604.04944

作者:Mohammad Reza Ghasemi Madani,Soyeon Caren Han,Shuo Yang,Jey Han Lau

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:evaluate large language, Multiple-choice questions, large language models, evaluate large, large language

备注

点击查看摘要

Abstract:Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

119. 【2604.04943】he Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

链接https://arxiv.org/abs/2604.04943

作者:Julian Coda-Forno,Jane X. Wang,Arslan Chaudhry

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reversal curse describes, autoregressive language models, decoder-only masking-based training, describes a failure, failure of autoregressive

备注: ICLR 2026 Workshop on Representational Alignment (Re-Align)

点击查看摘要

Abstract:The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``$A B$'' but failing on ``$B A$''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes'' can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.

120. 【2604.04942】DA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

链接https://arxiv.org/abs/2604.04942

作者:Jiaquan Zhang,Qigan Sun,Chaoning Zhang,Xudong Wang,Zhenzhen Huang,Yitian Zhou,Pengcheng Zheng,Chi-lok Andy Tai,Sung-Ho Bae,Zeyu Ma,Caiyan Qin,Jinyu Guo,Yang Yang,Hengtao Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, natural language processing, language models, language processing, remains a core

备注: 14 pages, 4 figures

点击查看摘要

Abstract:Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve strong performance and reveal effective reasoning structures, their high cost limits practical use. To address this problem, this paper proposes a topology-based method for optimizing reasoning chains. The framework embeds essential topological patterns of effective reasoning into the lightweight CoT paradigm. Using persistent homology, we map CoT, ToT, and GoT into a unified topological space to quantify their structural features. On this basis, we design a unified optimization system: a Topological Optimization Agent diagnoses deviations in CoT chains from desirable topological characteristics and simultaneously generates targeted strategies to repair these structural deficiencies. Compared with multi-round reasoning methods like ToT and GoT, experiments on multiple datasets show that our approach offers a superior balance between reasoning accuracy and efficiency, showcasing a practical solution to ``single-round generation with multi-round intelligence''.

121. 【2604.05774】GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding

链接https://arxiv.org/abs/2604.05774

作者:Weicai Long,Yusen Hou,Junning Feng,Houcheng Su,Shuo Yang,Donglin Xie,Yanlin Zhang

类目:Genomics (q-bio.GN); Computation and Language (cs.CL)

关键词:Large Language Models, natural language interfaces, Large Language, language interfaces, natural language

备注: 18 pages, 9 figures, coference

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, Transcription Factor Binding Site Prediction, and TF Motif Prediction. Across six frontier LLMs, we find that models consistently outperform random baselines and can exploit local sequence signals such as GC content and short motifs, while performance degrades on tasks that require more indirect or multi-step inference over sequence patterns. GenomeQA establishes a diagnostic benchmark for studying and improving the use of general-purpose LLMs on raw genomic sequences.

信息检索

1. 【2604.06163】Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

链接https://arxiv.org/abs/2604.06163

作者:Wei Huang,Keping Bi,Yinqiong Cai,Wei Chen,Jiafeng Guo,Xueqi Cheng

类目:Information Retrieval (cs.IR)

关键词:Recent studies show, favoring passages generated, Recent studies, favoring passages, semantically similar

备注

点击查看摘要

Abstract:Recent studies show that neural retrievers often display source bias, favoring passages generated by LLMs over human-written ones, even when both are semantically similar. This bias has been considered an inherent flaw of retrievers, raising concerns about the fairness and reliability of modern information access systems. Our work challenges this view by showing that source bias stems from supervision in retrieval datasets rather than the models themselves. We found that non-semantic differences, like fluency and term specificity, exist between positive and negative documents, mirroring differences between LLM and human texts. In the embedding space, the bias direction from negatives to positives aligns with the direction from human-written to LLM-generated texts. We theoretically show that retrievers inevitably absorb the artifact imbalances in the training data during contrastive learning, which leads to their preferences over LLM texts. To mitigate the effect, we propose two approaches: 1) reducing artifact differences in training data and 2) adjusting LLM text vectors by removing their projection on the bias vector. Both methods substantially reduce source bias. We hope our study alleviates some concerns regarding LLM-generated texts in information access systems.

2. 【2604.06098】JUÁ - A Benchmark for Information Retrieval in Brazilian Legal Text Collections

链接https://arxiv.org/abs/2604.06098

作者:Jayr Pereira,Leandro Fernandes,Erick de Brito,Roberto Lotufo,Luiz Bonifacio

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Portuguese remains difficult, datasets differ widely, Brazilian legal, JUÁ, query style

备注

点击查看摘要

Abstract:Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textsc{JUÁ}, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textsc{JUÁ} is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textsc{JUÁ}-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textsc{JUÁ-Juris} subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textsc{JUÁ} provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

3. 【2604.06097】Masking or Mitigating? Deconstructing the Impact of Query Rewriting on Retriever Biases in RAG

链接https://arxiv.org/abs/2604.06097

作者:Agam Goyal,Koyel Mukherjee,Apoorv Saxena,Anirudh Phukan,Eshwar Chandrasekharan,Hari Sundaram

类目:Information Retrieval (cs.IR)

关键词:compromise retrieval quality, including brevity, literal matching, exhibit systematic biases, RAG

备注: ACL'26: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Dense retrievers in retrieval-augmented generation (RAG) systems exhibit systematic biases -- including brevity, position, literal matching, and repetition biases -- that can compromise retrieval quality. Query rewriting techniques are now standard in RAG pipelines, yet their impact on these biases remains unexplored. We present the first systematic study of how query enhancement techniques affect dense retrieval biases, evaluating five methods across six retrievers. Our findings reveal that simple LLM-based rewriting achieves the strongest aggregate bias reduction (54\%), yet fails under adversarial conditions where multiple biases combine. Mechanistic analysis uncovers two distinct mechanisms: simple rewriting reduces bias through increased score variance, while pseudo-document generation methods achieve reduction through genuine decorrelation from bias-inducing features. However, no technique uniformly addresses all biases, and effects vary substantially across retrievers. Our results provide practical guidance for selecting query enhancement strategies based on specific bias vulnerabilities. More broadly, we establish a taxonomy distinguishing query-document interaction biases from document encoding biases, clarifying the limits of query-side interventions for debiasing RAG systems.

4. 【2604.06028】A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

链接https://arxiv.org/abs/2604.06028

作者:Maria Mahbub,Gregory M. Dams,Josh Arnold,Caitlin Rizy,Sudarshan Srinivasan,Elliot M. Fielstein,Minu A. Aghevli,Kamonica L. Craig,Elizabeth M. Oliva,Joseph Erdos,Jodie Trafton,Ioana Danciu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, unstructured health records, extracting clinically meaningful, Large language, clinically meaningful information

备注

点击查看摘要

Abstract:Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

5. 【2604.05866】Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching

链接https://arxiv.org/abs/2604.05866

作者:Yicheng Pan,Zhiyuan Ning,Ludi Wang,Yi Du

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词:accurately recommending suitable, recommending suitable reviewers, conference submission volumes, submission volumes continue, continue to grow

备注: Accepted by IJCNN-2026

点击查看摘要

Abstract:As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper'' matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.

6. 【2604.05821】CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

链接https://arxiv.org/abs/2604.05821

作者:Seungyoon Lee,Minhyuk Kim,Seongtae Hong,Youngjoon Jang,Dongsuk Oh,Heuiseok Lim

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Existing multilingual embedding, imbalanced linguistic resources, multilingual embedding models, Existing multilingual, embedding models

备注: ACL2026 Main

点击查看摘要

Abstract:Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at this https URL.

7. 【2604.05818】WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

链接https://arxiv.org/abs/2604.05818

作者:Yingjian Zhu,Xinming Wang,Kun Ding,Ying Wang,Bin Fan,Shiming Xiang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Visual Question Answering, Knowledge-Based Visual Question, Question Answering, Visual Question, highly effective paradigm

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on this https URL.

8. 【2604.05766】he LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

链接https://arxiv.org/abs/2604.05766

作者:Moritz Staudinger,Wojciech Kusa,Allan Hanbury

类目:Information Retrieval (cs.IR)

关键词:long enabled controlled, enabled controlled comparison, progress in Information, Information Retrieval, long enabled

备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.

9. 【2604.05764】Generative Retrieval Overcomes Limitations of Dense Retrieval but Struggles with Identifier Ambiguity

链接https://arxiv.org/abs/2604.05764

作者:Adrian Bracher,Svitlana Vakulenko

类目:Information Retrieval (cs.IR)

关键词:shared low-dimensional space, gained widespread popu, exhibit important theoretical, traditional sparse retrieval, sparse retrieval models

备注: Work in progress

点击查看摘要

Abstract:While dense retrieval models, which embed queries and documents into a shared low-dimensional space, have gained widespread popu- larity, they were shown to exhibit important theoretical limitations and considerably lag behind traditional sparse retrieval models in certain settings. Generative retrieval has emerged as an alternative approach to dense retrieval by using a language model to predict query-document relevance directly. In this paper, we demonstrate strengths and weaknesses of generative retrieval approaches us- ing a simple synthetic dataset, called LIMIT, that was previously introduced to empirically demonstrate the theoretical limitations of embedding-based retrieval but was not used to evaluate genera- tive retrieval. We close this research gap and show that generative retrieval achieves the best performance on this dataset without any additional training required (0.92 and 0.99 R@2 for SEAL and MINDER, respectively), compared to dense approaches ( 0.03 Re- call@2) and BM25 (0.86 R@2). However, we then proceed to extend the original LIMIT dataset by adding simple hard negative samples and observe the performance degrading for all the models including the generative retrieval models (0.51 R@2) as well as BM25 (0.21 R@2). Error analysis identifies a failure in the decoding mechanism, caused by the inability to produce identifiers that are unique to relevant documents. Future generative retrieval must address these issues, either by designing identifiers that are more suitable to the decoding process or by adapting decoding and scoring algorithms to preserve relevance signals.

10. 【2604.05732】Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

链接https://arxiv.org/abs/2604.05732

作者:He Zhao,Zhiwei Zeng,Yongwei Wang,Chunyan Miao

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:graph, optimal graph structures, downstream tasks, graph structures, Graph Structure

备注

点击查看摘要

Abstract:Real-world heterogeneous graphs are inherently noisy and usually not in the optimal graph structures for downstream tasks, which often adversely affects the performance of GRL models in downstream tasks. Although Graph Structure Learning (GSL) methods have been proposed to learn graph structures and downstream tasks simultaneously, existing methods are predominantly designed for homogeneous graphs, while GSL for heterogeneous graphs remains largely unexplored. Two challenges arise in this context. Firstly, the quality of the input graph structure has a more profound impact on GNN-based heterogeneous GRL models compared to their homogeneous counterparts. Secondly, most existing homogenous GRL models encounter memory consumption issues when applied directly to heterogeneous graphs. In this paper, we propose a novel Graph Topology learning Enhanced Heterogeneous Graph Representation Learning framework (ToGRL).ToGRL learns high-quality graph structures and representations for downstream tasks by incorporating task-relevant latent topology information. Specifically, a novel GSL module is first proposed to extract downstream task-related topology information from a raw graph structure and project it into topology embeddings. These embeddings are utilized to construct a new graph with smooth graph signals. This two-stage approach to GSL separates the optimization of the adjacency matrix from node representation learning to reduce memory consumption. Following this, a representation learning module takes the new graph as input to learn embeddings for downstream tasks. ToGRL also leverages prompt tuning to better utilize the knowledge embedded in learned representations, thus enhancing adaptability to downstream tasks. Extensive experiments on five real-world datasets show that our ToGRL outperforms state-of-the-art methods by a large margin.

11. 【2604.05711】SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

链接https://arxiv.org/abs/2604.05711

作者:Guan-Yan Yang,Wei-Ling Wen,Shu-Yuan Ku,Farn Wang,Kuo-Hui Yeh

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:applications rely heavily, connect disparate information, Web applications rely, disparate information resources, applications rely

备注: Accepted at the 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026, Daejeon, Republic of Korea

点击查看摘要

Abstract:Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

12. 【2604.05684】Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

链接https://arxiv.org/abs/2604.05684

作者:Seongtae Hong,Youngjoon Jang,Jungseob Lee,Hyeonseok Moon,Heuiseok Lim

类目:Information Retrieval (cs.IR)

关键词:important research area, Cross-Lingual Information Retrieval, Cross-Lingual Information, Information Retrieval, research area

备注: ICLR 2026

点击查看摘要

Abstract:With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important research area. Conventionally, CLIR tasks have been conducted under settings where the language of documents differs from that of queries, and typically, the documents are composed in a single coherent language. In this paper, we highlight that in such a setting, the cross-lingual alignment capability may not be evaluated adequately. Specifically, we observe that, in a document pool where English documents coexist with another language, most multilingual retrievers tend to prioritize unrelated English documents over the related document written in the same language as the query. To rigorously analyze and quantify this phenomenon, we introduce various scenarios and metrics designed to evaluate the cross-lingual alignment performance of multilingual retrieval models. Furthermore, to improve cross-lingual performance under these challenging conditions, we propose a novel training strategy aimed at enhancing cross-lingual alignment. Using only a small dataset consisting of 2.8k samples, our method significantly improves the cross-lingual retrieval performance while simultaneously mitigating the English inclination problem. Extensive analyses demonstrate that the proposed method substantially enhances the cross-lingual alignment capabilities of most multilingual embedding models.

13. 【2604.05467】CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.05467

作者:Siddharth Jain,Venkat Narayan Vedam

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:single-shot answer generation, consumes evidence mid-inference, language models shift, evaluating the role, language models

备注: 6 figures, 14 tables; appendix includes bootstrap CIs, metric definitions, duplicate position sensitivity, prompt template, and reproducibility details

点击查看摘要

Abstract:As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

14. 【2604.05387】Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

链接https://arxiv.org/abs/2604.05387

作者:Xing Tang,Hao Chen,Shiwei Li,Fuyuan Lyu,Weijie Shi,Lingjie Li,Dugang Liu,Weihong Luo,Xiku Du,Xiuqiang He

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Large language models, numerous industrial applications, Large language, financial, industrial applications

备注: Accepted to Webconf 2026 industry track

点击查看摘要

Abstract:Large language models (LLMs) have been incorporated into numerous industrial applications. Meanwhile, a vast array of API assets is scattered across various functions in the financial domain. An online financial question-answering system can leverage both LLMs and private APIs to provide timely financial analysis and information. The key is equipping the LLM model with function calling capability tailored to a financial scenario. However, a generic LLM requires customized financial APIs to call and struggles to adapt to the financial domain. Additionally, online user queries are diverse and contain out-of-distribution parameters compared with the required function input parameters, which makes it more difficult for a generic LLM to serve online users. In this paper, we propose a data-driven pipeline to enhance function calling in LLM for our online, deployed financial QA, comprising dataset construction, data augmentation, and model training. Specifically, we construct a dataset based on a previous study and update it periodically, incorporating queries and an augmentation method named AugFC. The addition of user query-related samples will \textit{exploit} our financial toolset in a data-driven manner, and AugFC explores the possible parameter values to enhance the diversity of our updated dataset. Then, we train an LLM with a two-step method, which enables the use of our financial functions. Extensive experiments on existing offline datasets, as well as the deployment of an online scenario, illustrate the superiority of our pipeline. The related pipeline has been adopted in the financial QA of YuanBao\footnote{this https URL}, one of the largest chat platforms in China.

15. 【2604.05379】Retrieve-then-Adapt: Retrieval-Augmented Test-Time Adaptation for Sequential Recommendation

链接https://arxiv.org/abs/2604.05379

作者:Xing Tang,Jingyang Bin,Ziqiang Cui,Xiaokun Zhang,Fuyuan Lyu,Jingyan Jiang,Dugang Liu,Chen Ma,Xiuqiang He

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:historical interaction sequences, users' historical interaction, sequential recommendation, task aims, interaction sequences

备注

点击查看摘要

Abstract:The sequential recommendation (SR) task aims to predict the next item based on users' historical interaction sequences. Typically trained on historical data, SR models often struggle to adapt to real-time preference shifts during inference due to challenges posed by distributional divergence and parameterized constraints. Existing approaches to address this issue include test-time training, test-time augmentation, and retrieval-augmented fine-tuning. However, these methods either introduce significant computational overhead, rely on random augmentation strategies, or require a carefully designed two-stage training paradigm. In this paper, we argue that the key to effective test-time adaptation lies in achieving both effective augmentation and efficient adaptation. To this end, we propose Retrieve-then-Adapt (ReAd), a novel framework that dynamically adapts a deployed SR model to the test distribution through retrieved user preference signals. Specifically, given a trained SR model, ReAd first retrieves collaboratively similar items for a test user from a constructed collaborative memory database. A lightweight retrieval learning module then integrates these items into an informative augmentation embedding that captures both collaborative signals and prediction-refinement cues. Finally, the initial SR prediction is refined via a fusion mechanism that incorporates this embedding. Extensive experiments across five benchmark datasets demonstrate that ReAd consistently outperforms existing SR methods.

16. 【2604.05365】From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation

链接https://arxiv.org/abs/2604.05365

作者:Ziang Lu,Lei Sang,Lin Mu,Yiwen Zhang

类目:Information Retrieval (cs.IR)

关键词:exploits multi-domain correlations, alleviate data sparsity, Cross-domain Recommendation, exploits multi-domain, multi-domain correlations

备注: 11 pages, 6 figures

点击查看摘要

Abstract:Cross-domain Recommendation (CDR) exploits multi-domain correlations to alleviate data sparsity. As a core task within this field, inter-domain recommendation focuses on predicting preferences for users who interact in a source domain but lack behavioral records in a target domain. Existing approaches predominantly rely on overlapping users as anchors for knowledge transfer. In real-world scenarios, overlapping users are often scarce, leaving the vast majority of users with only single-domain interactions. For these users, the absence of explicit alignment signals makes fine-grained preference transfer intrinsically difficult. To address this challenge, this paper proposes Language-Guided Conditional Diffusion for CDR (LGCD), a novel framework that integrates Large Language Models (LLMs) and diffusion models for inter-domain sequential recommendation. Specifically, we leverage LLM reasoning to bridge the domain gap by inferring potential target preferences for single-domain users and mapping them to real items, thereby constructing pseudo-overlapping data. We distinguish between real and pseudo-interaction pathways and introduce additional supervision constraints to mitigate the semantic noise brought by pseudo-interaction. Furthermore, we design a conditional diffusion architecture to precisely guide the generation of target user representations based on source-domain patterns. Extensive experiments demonstrate that LGCD significantly outperforms state-of-the-art methods in inter-domain recommendation tasks.

17. 【2604.05341】Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation

链接https://arxiv.org/abs/2604.05341

作者:Xiangchen Pan,Wei Wei

类目:Information Retrieval (cs.IR)

关键词:Explainable recommendation systems, explicitly uncover, uncover the rationale, transparency and credibility, recommendation systems

备注: Accepted at DASFAA 2026. This is the author version

点击查看摘要

Abstract:Explainable recommendation systems (RSs) are designed to explicitly uncover the rationale of each recommendation, thereby enhancing the transparency and credibility of RSs. Previous methods often jointly predicted ratings and generated explanations, but overlooked the incoherence of such two objectives. To address this issue, we propose Curr-RLCER, a reinforcement learning framework for explanation coherent recommendation with dynamic rating alignment. It employs curriculum learning, transitioning from basic predictions (i.e., click through rating-CTR, selection-based rating) to open-ended recommendation explanation generation. In particular, the rewards of each stage are designed for progressively enhancing the stability of RSs. Furthermore, a coherence-driven reward mechanism is also proposed to enforce the coherence between generated explanations and predicted ratings, supported by a specifically designed evaluation scheme. The extensive experimental results on three explainable recommendation datasets indicate that the proposed framework is effective. Codes and datasets are available at this https URL.

18. 【2604.05329】Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

链接https://arxiv.org/abs/2604.05329

作者:Tianyu Zhan,Kairui Fu,Chengfei Lv,Zheqi Lv,Shengyu Zhang

类目:Information Retrieval (cs.IR)

关键词:intrinsic item relationships, Generative Recommendation, capture intrinsic item, Semantic Dilution Effect, recently transitioned

备注

点击查看摘要

Abstract:Generative Recommendation (GR) has recently transitioned from atomic item-indexing to Semantic ID (SID)-based frameworks to capture intrinsic item relationships and enhance generalization. However, the adoption of high-granularity SIDs leads to two critical challenges: prohibitive training overhead due to sequence expansion and unstable performance reliability characterized by non-monotonic accuracy fluctuations. We identify that these disparate issues are fundamentally rooted in the Semantic Dilution Effect, where redundant tokens waste massive computation and dilute the already sparse learning signals in recommendation. To counteract this, we propose STAMP (Semantic Trimming and Auxiliary Multi-step Prediction), a framework utilizing a dual-end optimization strategy. We argue that effective SID learning requires simultaneously addressing low input information density and sparse output supervision. On the input side, Semantic Adaptive Pruning (SAP) dynamically filters redundancy during the forward pass, converting noise-laden sequences into compact, information-rich representations. On the output side, Multi-step Auxiliary Prediction (MAP) employs a multi-token objective to densify feedback, strengthening long-range dependency capture and ensuring robust learning signals despite compressed inputs. Unifying input purification and signal amplification, STAMP enhances both training efficiency and representation capability. Experiments on public Amazon and large-scale industrial datasets show STAMP achieves 1.23--1.38$\times$ speedup and 17.2\%--54.7\% VRAM reduction while maintaining or improving performance across multiple architectures.

19. 【2604.05314】Next-Scale Generative Reranking: A Tree-based Generative Rerank Method at Meituan

链接https://arxiv.org/abs/2604.05314

作者:Shuli Wang,Changhao Li,Ke Fan,Senjie Kou Junwei Yin,Chi Wang,Yinhua Zhu,Haitao Wang,Xingxing Wang

类目:Information Retrieval (cs.IR)

关键词:modeling contextual information, multi-stage recommendation systems, modern multi-stage recommendation, reranking plays, contextual information

备注

点击查看摘要

Abstract:In modern multi-stage recommendation systems, reranking plays a critical role by modeling contextual information. Due to inherent challenges such as the combinatorial space complexity, an increasing number of methods adopt the generative paradigm: the generator produces the optimal list during inference, while an evaluator guides the generator's optimization during the training phase. However, these methods still face two problems. Firstly, these generators fail to produce optimal generation results due to the lack of both local and global perspectives, regardless of whether the generation strategy is autoregressive or non-autoregressive. Secondly, the goal inconsistency problem between the generator and the evaluator during training complicates the guidance signal and leading to suboptimal performance. To address these issues, we propose the \textbf{N}ext-\textbf{S}cale \textbf{G}eneration \textbf{R}eranking (NSGR), a tree-based generative framework. Specifically, we introduce a next-scale generator (NSG) that progressively expands a recommendation list from user interests in a coarse-to-fine manner, balancing global and local perspectives. Furthermore, we design a multi-scale neighbor loss, which leverages a tree-based multi-scale evaluator (MSE) to provide scale-specific guidance to the NSG at each scale. Extensive experiments on public and industrial datasets validate the effectiveness of NSGR. And NSGR has been successfully deployed on the Meituan food delivery platform.

20. 【2604.05309】Pay Attention to Sequence Split: Uncovering the Impacts of Sub-Sequence Splitting on Sequential Recommendation Models

链接https://arxiv.org/abs/2604.05309

作者:Yizhou Dang,Yifan Wu,Minhan Huang,Chuang Zhao,Lianbo Ma,Guibing Guo,Xingwei Wang,Zhu Sun

类目:Information Retrieval (cs.IR)

关键词:raw user interaction, user interaction sequence, mitigate data sparsity, sequential recommendation, SSS

备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Sub-sequence splitting (SSS) has been demonstrated as an effective approach to mitigate data sparsity in sequential recommendation (SR) by splitting a raw user interaction sequence into multiple sub-sequences. Previous studies have demonstrated its ability to enhance the performance of SR models significantly. However, in this work, we discover that \textbf{(i). SSS may interfere with the evaluation of the model's actual performance.} We observed that many recent state-of-the-art SR models employ SSS during the data reading stage (not mentioned in the papers). When we removed this operation, performance significantly declined, even falling below that of earlier classical SR models. The varying improvements achieved by SSS and different splitting methods across different models prompt us to analyze further when SSS proves effective. We find that \textbf{(ii). SSS demonstrates strong capabilities only when specific splitting methods, target strategies, and loss functions are used together.} Inappropriate combinations may even harm performance. Furthermore, we analyze why sub-sequence splitting yields such remarkable performance gains and find that \textbf{(iii). it evens out the distribution of training data while increasing the likelihood that different items are targeted.} Finally, we provide suggestions for overcoming SSS interference, along with a discussion on data augmentation methods and future directions. We hope this work will prompt the broader community to re-examine the impact of data splitting on SR and promote fairer, more rigorous model evaluation. All analysis code and data will be made available upon acceptance. We provide a simple, anonymous implementation at this https URL.

21. 【2604.05253】Spike Hijacking in Late-Interaction Retrieval

链接https://arxiv.org/abs/2604.05253

作者:Karthik Suresh,Tushar Vatsa,Tracy King,Asim Kadav,Michael Friedrich

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:aggregate token-level similarities, hard maximum similarity, maximum similarity, token-level similarities, retrieval models rely

备注: Accepted at the 1st Late Interaction Retrieval Workshop (LIR 2026) at ECIR 2026. Published in CEUR Workshop Proceedings

点击查看摘要

Abstract:Late-interaction retrieval models rely on hard maximum similarity (MaxSim) to aggregate token-level similarities. Although effective, this winner-take-all pooling rule may structurally bias training dynamics. We provide a mechanistic study of gradient routing and robustness in MaxSim-based retrieval. In a controlled synthetic environment with in-batch contrastive training, we demonstrate that MaxSim induces significantly higher patch-level gradient concentration than smoother alternatives such as Top-k pooling and softmax aggregation. While sparse routing can improve early discrimination, it also increases sensitivity to document length: as the number of document patches grows, MaxSim degrades more sharply than mild smoothing variants. We corroborate these findings on a real-world multi-vector retrieval benchmark, where controlled document-length sweeps reveal similar brittleness under hard max pooling. Together, our results isolate pooling-induced gradient concentration as a structural property of late-interaction retrieval and highlight a sparsity-robustness tradeoff. These findings motivate principled alternatives to hard max pooling in multi-vector retrieval systems.

22. 【2604.05204】Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking

链接https://arxiv.org/abs/2604.05204

作者:Shubham Chatterjee

类目:Information Retrieval (cs.IR)

关键词:Entity-oriented retrieval assumes, exhibit query-relevant entities, documents exhibit query-relevant, report conflicting results, Entity-oriented retrieval

备注

点击查看摘要

Abstract:Entity-oriented retrieval assumes that relevant documents exhibit query-relevant entities, yet evaluations report conflicting results. We show this inconsistency stems not from model failure, but from evaluation. On TREC Robust04, we evaluate six neural rerankers and 437 unsupervised configurations against BM25. Across 443 systems, none improves MAP by more than 0.05 under open-world evaluation over the full candidate set, despite strong gains under entity-restricted settings. The best configuration matches the official Robust04 best system and outperforms most neural rerankers, indicating that the architecture is not the limiting factor. Instead, the bottleneck is the entity channel: even under idealized selection, entity signals cover only 19.7\% of relevant documents, and no method achieves both high coverage and discrimination. We explain this via a distinction between Conceptual Entity Relevance (CER) -- semantic relatedness -- and Observable Entity Relevance (OER) -- corpus-grounded discriminativeness under a given linker. All supervision strategies operate at the CER level and ignore the linking environment, leading to signals that are semantically valid but not discriminative. Improving supervision therefore does not recover open-world performance: stronger signals reduce coverage without improving effectiveness. Conditional and open-world evaluation answer different questions: exploiting entity evidence versus improving retrieval under realistic linking, but are often conflated. Progress requires datasets with entity-level discriminativeness and evaluation that reports both coverage and effectiveness. Until then, conditional gains do not imply open-world effectiveness, and open-world failures do not invalidate entity-based models.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.05204 [cs.IR]

(or
arXiv:2604.05204v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.05204

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shubham Chatterjee [view email] [v1]
Mon, 6 Apr 2026 22:02:35 UTC (200 KB)

23. 【2604.05190】Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

链接https://arxiv.org/abs/2604.05190

作者:Ziyi Chen,Mengxian Lyu,Cheng Peng,Yonghui Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:labor-intensive bottleneck, patients for enrollment, bottleneck that leads, leads to under-enrollment, LLMs

备注

点击查看摘要

Abstract:Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the "Lost in the Middle" issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

24. 【2604.05125】Offline RL for Adaptive Policy Retrieval in Prior Authorization

链接https://arxiv.org/abs/2604.05125

作者:Ruslan Sharifullin,Maxim Gorshkov,Hannah Clay

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Prior authorization, existing retrieval-augmented systems, retrieval-augmented systems rely, requires interpretation, retrieved sections

备注: 9 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL's 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a "selective-accurate" region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs $\lambda \in \{0.05, 0.1, 0.2\}$ reveals a clear accuracy-efficiency inflection: only at $\lambda = 0.2$ does CQL transition from exhaustive to selective retrieval.

25. 【2604.05113】CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation

链接https://arxiv.org/abs/2604.05113

作者:Zezhong Fan,Ziheng Chen,Luyi Ma,Jin Huang,Lalitesh Morishetti,Kaushiki Nag,Sushant Kumar,Kannan Achan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:generative manner, popularity bias, Generative recommendation, paradigm that represents, Generative

备注: Generative Recommendation

点击查看摘要

Abstract:Generative recommendation (GeneRec) has introduced a new paradigm that represents items as discrete semantic tokens and predicts items in a generative manner. Despite its strong performance across multiple recommendation tasks, existing GeneRec approaches still suffer from severe popularity bias and may even exacerbate it. In this work, we conduct a comprehensive empirical analysis to uncover the root causes of this phenomenon, yielding two core insights: 1) imbalanced tokenization inherits and can further amplify popularity bias from historical item interactions; 2) current training procedures disproportionately favor popular tokens while neglecting semantic relationships among tokens, thereby intensifying popularity bias. Building on these insights, we propose CRAB, a post-hoc debiasing strategy for GeneRec that alleviates popularity bias by mitigating frequency imbalance among semantic tokens. Specifically, given a well-trained model, we first rebalance the codebook by splitting over-popular tokens while preserving their hierarchical semantic structure. Based on the adjusted codebook, we further introduce a tree-structured regularizer to enhance semantic consistency, encouraging more informative representations for unpopular tokens during training. Experiments on real-world datasets demonstrate that CRAB significantly improves recommendation performance by effectively alleviating popularity bias.

26. 【2604.05087】Document Optimization for Black-Box Retrieval via Reinforcement Learning

链接https://arxiv.org/abs/2604.05087

作者:Omri Uzan,Ron Polonsky,Douwe Kiela,Christopher Potts

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:shifts computation offline, avoiding additional query-time, additional query-time processing, computation offline, avoiding additional

备注

点击查看摘要

Abstract:Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever's ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.

27. 【2604.04997】Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

链接https://arxiv.org/abs/2604.04997

作者:Rong Lu,Hao Liu,Song Hou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:geoscience technical documents, classifying geoscience technical, technical documents, work presents, presents a comparative

备注: Accepted at the IMAGE'25 Workshop (PCW-11), Society of Exploration Geophysicists (SEG). Published version available at [this https URL](https://doi.org/10.1190/image2025-w11-03.1)

点击查看摘要

Abstract:This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-Language Models (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.

28. 【2604.04982】CURE:Circuit-Aware Unlearning for LLM-based Recommendation

链接https://arxiv.org/abs/2604.04982

作者:Ziheng Chen,Jiali Cheng,Zezhong Fan,Hadi Amiri,Yunzhi Yao,Xiangguo Sun,Yang Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:enabling rich semantic, rich semantic understanding, Recent advances, large language models, advances in large

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new opportunities for recommender systems by enabling rich semantic understanding and reasoning about user interests and item attributes. However, as privacy regulations tighten, incorporating user data into LLM-based recommendation (LLMRec) introduces significant privacy risks, making unlearning algorithms increasingly crucial for practical deployment. Despite growing interest in LLMRec unlearning, most existing approaches formulate unlearning as a weighted combination of forgetting and retaining objectives while updating model parameters in a uniform manner. Such formulations inevitably induce gradient conflicts between the two objectives, leading to unstable optimization and resulting in either ineffective unlearning or severe degradation of model utility. Moreover, the unlearning procedure remains largely black-box, undermining its transparency and trustworthiness. To tackle these challenges, we propose CURE, a circuit-aware unlearning framework that disentangles model components into functionally distinct subsets and selectively updates them. Here, a circuit refers to a computational subgraph that is causally responsible for task-specific behaviors. Specifically, we extract the core circuits underlying item recommendation and analyze how individual modules within these circuits contribute to the forget and retain objectives. Based on this analysis, these modules are categorized into forget-specific, retain-specific, and task-shared groups, each subject to function-specific update rules to mitigate gradient conflicts during unlearning. Experiments on real-world datasets show that our approach achieves more effective unlearning than existing baselines.

29. 【2604.04976】ncent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation

链接https://arxiv.org/abs/2604.04976

作者:Junwei Pan,Wei Xue,Chao Zhou,Xing Zhou,Lunan Fan,Yanbo Wang,Haoran Xin,Zhiyu Hu,Yaozheng Wang,Fengye Xu,Yurong Yang,Xiaotian Li,Junbang Huo,Wentao Ning,Yuliang Sun,Chengguo Yin,Jun Zhang,Shudong Huang,Lei Xiao,Huan Yu,Irwin King,Haijie Gu,Jie Jiang

类目:Information Retrieval (cs.IR)

关键词:discrete token spaces, Generative recommender systems, all-modality generative recommendation, Tencent Advertising Algorithm, Advertising Algorithm Challenge

备注

点击查看摘要

Abstract:Generative recommender systems are rapidly emerging as a new paradigm for recommendation, where collaborative identifiers and/or multi-modal content are mapped into discrete token spaces and user behavior is modelled with autoregressive sequence models. Despite progress on multi-modal recommendation datasets, there is still a lack of public benchmarks that jointly offer large-scale, realistic and fully all-modality data designed specifically for generative recommendation (GR) in industrial advertising. To foster research in this direction, we organised the Tencent Advertising Algorithm Challenge 2025, a global competition built on top of two all-modality datasets for GR: TencentGR-1M and TencentGR-10M. Both datasets are constructed from real de-identified Tencent Ads logs and contain rich collaborative IDs and multi-modal representations extracted with state-of-the-art embedding models. The preliminary track (TencentGR-1M) provides 1 million user sequences with up to 100 interacted items each, where each interaction is labeled with exposure and click signals, while the final track (TencentGR-10M) scales this to 10 million users and explicitly distinguishes between click and conversion events at both the sequence and target level. This paper presents the task definition, data construction process, feature schema, baseline GR model, evaluation protocol, and key findings from top-ranked and award-winning solutions. Our datasets focus on multi-modal sequence generation in an advertising setting and introduce weighted evaluation for high-value conversion events. We release our datasets at this https URL and baseline implementations at this https URL to enable future research on all-modality generative recommendation at an industrial scale. The official website is this https URL.

30. 【2604.04969】MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.04969

作者:Sijun Dai,Qiang Huang,Xiaoxing You,Jun Yu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.

31. 【2604.04953】Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

链接https://arxiv.org/abs/2604.04953

作者:Abhishek Dharmaratnakar,Srivaths Ranganathan,Debanshu Das,Anushree Sinha

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:profound paradigm shift, Large Language Models, heuristic-based extraction methods, Multimodal Large Language, Large Language

备注: 7 pages, 3 figures, accepted in WSDM 2026

点击查看摘要

Abstract:The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

32. 【2604.04949】Learning to Retrieve from Agent Trajectories

链接https://arxiv.org/abs/2604.04949

作者:Yuqi Zhou,Sunhao Dai,Changle Qu,Liang Pang,Jun Xu,Ji-Rong Wen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:methods relying heavily, Information retrieval, large-scale human interaction, human interaction logs, systems have traditionally

备注

点击查看摘要

Abstract:Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

33. 【2604.04948】From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

链接https://arxiv.org/abs/2604.04948

作者:José Guilherme Marques dos Santos,Ricardo Yang,Rui Humberto Pereira,Alexandre Sousa,Brígida Mónica Faria,Henrique Lopes Cardoso,José Duarte,José Luís Reis,Luís Paulo Reis,Pedro Pimenta,José Paulo Marques dos Santos

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:evaluated PDF processing, Retrieval-Augmented Generation, PDF processing frameworks, downstream question-answering accuracy, systems depend critically

备注: 21 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: naïve PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%). Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone. Font-based hierarchy rebuilding consistently outperformed LLM-based approaches. An exploratory GraphRAG implementation scored only 82%, underperforming basic RAG, suggesting that naïve knowledge graph construction without ontological guidance does not yet justify its added complexity. These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

34. 【2604.04947】SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

链接https://arxiv.org/abs/2604.04947

作者:Nitish Kumar,Sannu Kumar,S Akash,Manish Gupta,Ankith Karat,Sriparna Saha

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:enhancing user engagement, extracting meaningful pre-game, online sports journalism, engagement and comprehension, rapid proliferation

备注

点击查看摘要

Abstract:With the rapid proliferation of online sports journalism, extracting meaningful pre-game and post-game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two-step validation pipeline leveraging both open-source and proprietary large language models (LLMs). We then utilize multiple state-of-the-art LLMs (GPT-4o, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore-based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT-4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests. Our results demonstrate the effectiveness of this approach in generating high-quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content. The source code is availble here this https URL.

35. 【2604.04936】Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

链接https://arxiv.org/abs/2604.04936

作者:Uday Allu,Sonu Kedia,Tanmay Odapally,Biddwan Ahmed

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:balance retrieval quality, systems critically depend, Retrieval-Augmented Generation, effective document chunking, document chunking strategies

备注: 13 pages, 9 tables, 0 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully agentic chunking, often suffer from high token consumption, redundant text generation, limited scalability, and poor debuggability, especially for large-scale web content ingestion. In this paper, we propose Web Retrieval-Aware Chunking (W-RAC), a novel, cost-efficient chunking framework designed specifically for web-based documents. W-RAC decouples text extraction from semantic chunk planning by representing parsed web content as structured, ID-addressable units and leveraging large language models (LLMs) only for retrieval-aware grouping decisions rather than text generation. This significantly reduces token usage, eliminates hallucination risks, and improves system this http URL analysis and architectural comparison demonstrate that W-RAC achieves comparable or better retrieval performance than traditional chunking approaches while reducing chunking-related LLM costs by an order of magnitude.

计算机视觉

1. 【2604.06168】Action Images: End-to-End Policy Learning via Multiview Video Generation

链接https://arxiv.org/abs/2604.06168

作者:Haoyu Zhen,Zixian Gao,Qiao Sun,Yilin Zhao,Yuncong Yang,Yilun Du,Tsun-Hsuan Wang,Yi-Ling Qiao,Chuang Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:leverage powerful video, action, future states, Action Images, leverage powerful

备注: Project Page: [this https URL](https://actionimages.github.io/)

点击查看摘要

Abstract:World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

2. 【2604.06165】HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

链接https://arxiv.org/abs/2604.06165

作者:Reihaneh Zohrabi,Hosein Hasani,Akshita Gupta,Mahdieh Soleymani Baghshah,Anna Rohrbach,Marcus Rohrbach

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large vision-language models, Large vision-language, Large, vision-language models, effective detection

备注

点击查看摘要

Abstract:Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models' internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

3. 【2604.06161】DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

链接https://arxiv.org/abs/2604.06161

作者:Zhengming Yu,Li Ma,Mingming He,Leo Isikdogan,Yuancheng Xu,Dmitriy Smirnov,Pablo Salamanca,Dao Mi,Pablo Delgado,Ning Yu,Julien Philip,Xin Li,Wenping Wang,Paul Debevec

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:original high dynamic, low dynamic range, high dynamic range, HDR, saturation and quantization

备注: Project page: [this https URL](https://yzmblog.github.io/projects/DiffHDR/)

点击查看摘要

Abstract:Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

4. 【2604.06160】he Character Error Vector: Decomposable errors for page-level OCR evaluation

链接https://arxiv.org/abs/2604.06160

作者:Jonathan Bourne,Mwiza Simbeye,Joseph Nockels

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Optical Character Recognition, Character Error Rate, Character Recognition, Optical Character, Character Error

备注: 6643 words, 5 figures, 15 tables

点击查看摘要

Abstract:The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

5. 【2604.06156】MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

链接https://arxiv.org/abs/2604.06156

作者:Yuchi Wang,Haiyang Yu,Weikang Bian,Jiefeng Long,Xiao Liang,Chao Feng,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:capabilities remain underutilized, generative reasoning capabilities, reasoning capabilities remain, remain underutilized, successfully applied

备注

点击查看摘要

Abstract:MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

6. 【2604.06129】PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

链接https://arxiv.org/abs/2604.06129

作者:David Picard,Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Davide Allegro,Tom Ravaud,Yohann Perron,Corentin Sautier,Zeynep Sonat Baltaci,Fei Meng,Syrine Kalleli,Marta López-Rauhut,Thibaut Loiseau,Ségolène Albouy,Raphael Baena,Elliot Vincent,Loic Landrieu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Polynomial Mixer, token mixing mechanism, paper introduces, mixing mechanism, mechanism with linear

备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at this https URL.

7. 【2604.06124】Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

链接https://arxiv.org/abs/2604.06124

作者:Hao Chen,Fang Qiu,Fangchao Dong,Defei Yang,Eve Bohnett,Li An

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real drone-collected dataset, multimodal adaptation framework, study proposes, framework to bridge, thermal infrared imagery

备注

点击查看摘要

Abstract:This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

8. 【2604.06113】SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

链接https://arxiv.org/abs/2604.06113

作者:Hiba Dahmani,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Laurent Caraffa,Jean-Philippe Tarel,Roland Brémond

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driving scenes requires, Scalable generation, remain consistent, consistent across multiple, multiple viewpoints

备注

点击查看摘要

Abstract:Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $\Sigma$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $\Sigma$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

9. 【2604.06099】Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

链接https://arxiv.org/abs/2604.06099

作者:Athanasios Angelakis,Marta Gomez-Barrero

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hierarchical Vision Transformer, Compact Hierarchical Vision, Zero-token Adaptive Compact, Adaptive Compact Hierarchical, Hierarchical Vision

备注: Accepted at CVPR 2026 Workshop (PHAROS-AIF-MIH)

点击查看摘要

Abstract:The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

10. 【2604.06079】Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

链接https://arxiv.org/abs/2604.06079

作者:Juekai Lin,Yun Zhu,Honglin Lin,Sijing Li,Tianwei Lin,Zheng Liu,Xiaoyang Wang,Wenqiao Zhang,Lijun Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Graphics Program Synthesis, Graphics Program, Program Synthesis, Multimodal Large Language, Synthesis is pivotal

备注

点击查看摘要

Abstract:Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

11. 【2604.06074】Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

链接https://arxiv.org/abs/2604.06074

作者:Junbin Zhang,Meng Cao,Feng Tan,Yikai Lin,Yuexian Zou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:structurally sound controllability, advanced visual generation, Achieving fine-grained, structurally sound, sound controllability

备注: 11 pages, 5 figures, Accepted by ICME 2026

点击查看摘要

Abstract:Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at this https URL.

12. 【2604.06063】EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

链接https://arxiv.org/abs/2604.06063

作者:Takara Taniguchi,Ryohei Shimizu,Minh-Duc Vo,Kota Izumi,Shiqi Yang,Teppei Suzuki

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:poses significant risks, models poses significant, poses significant, significant risks, risks of copyright

备注

点击查看摘要

Abstract:The advent of Text-to-Image generative models poses significant risks of copyright violation and deepfake generation. Since the rapid proliferation of new copyrighted works and private individuals constantly emerges, reference-based training-free content filters are essential for providing up-to-date protection without the constraints of a fixed knowledge cutoff. However, existing reference-based approaches often lack scalability when handling numerous references and require waiting for finishing image generation. To solve these problems, we propose EDGE-Shield, a scalable content filter during the denoising process that maintains practical latency while effectively blocking violative content. We leverage embedding-based matching for efficient reference comparison. Additionally, we introduce an \textit{$x$}-pred transformation that converts the model's noisy intermediate latent into the pseudo-estimated clean latent at the later stage, enhancing classification accuracy of violative content at earlier denoising stages. We conduct experiments of violative content filtering against two generative models including Z-Image-Turbo and Qwen-Image. EDGE-Shield significantly outperforms traditional reference-based methods in terms of latency; it achieves an approximate $79\%$ reduction in processing time for Z-Image-Turbo and approximate $50\%$ reduction for Qwen-Image, maintaining the filtering accuracy across different model architectures.

13. 【2604.06052】Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

链接https://arxiv.org/abs/2604.06052

作者:Katarzyna Zaleska,Łukasz Popek,Monika Wysoczańska,Kamil Deja

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remarkable generative capabilities, operations remain opaque, exhibit remarkable generative, internal operations remain, diffusion models exhibit

备注: CVPR 2026

点击查看摘要

Abstract:Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at this https URL.

14. 【2604.06036】CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

链接https://arxiv.org/abs/2604.06036

作者:Yulin Zou,Yan Chen,Wenyan Chen,JooYoung Park,Shivaraman Nitin,Luo Tao,Francisco Romero,Dmitrii Ustiugov

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:vision-language model serving, inference limits scalability, multimodal inference limits, model serving, limits scalability

备注: 18 pages, 34 figures

点击查看摘要

Abstract:Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.

Comments:
18 pages, 34 figures

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2604.06036 [cs.DC]

(or
arXiv:2604.06036v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2604.06036

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2604.06017】oward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST

链接https://arxiv.org/abs/2604.06017

作者:Michael Karnes,Alper Yilmaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, Aristotelian Rapid Object, Platonic Representation Hypothesis, nature of backpropagation-based, Rapid Object Modeling

备注

点击查看摘要

Abstract:While deep learning has achieved remarkable success in medical imaging, the "black-box" nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model's logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, "few-shot" solution that meets the rigorous transparency demands of modern clinical environments.

16. 【2604.06010】OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

链接https://arxiv.org/abs/2604.06010

作者:Yukun Wang,Ruihuang Li,Jiale Tao,Shiyuan Yang,Liyi Chen,Zhantao Yang,Handz,Yulan Guo,Shuai Shao,Qinglin Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video fundamentally intertwines, crucial axes, fundamentally intertwines, intertwines two crucial, Video fundamentally

备注

点击查看摘要

Abstract:Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

17. 【2604.05971】Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

链接https://arxiv.org/abs/2604.05971

作者:Oscar Chew,Hsiao-Ying Huang,Kunal Jain,Tai-I Chen,Khoa D Doan,Kuan-Hao Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:lack fine-grained understanding, contrastive vision-language models, recent model variants, research has shown, shown that contrastive

备注

点击查看摘要

Abstract:Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.

18. 【2604.05961】HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

链接https://arxiv.org/abs/2604.05961

作者:Tao Hu,Varun Jampani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tremendous recent progress, generative video diffusion, video diffusion models, video, human video generation

备注: Project page: [this https URL](https://taohuumd.github.io/projects/HumANDiff/)

点击查看摘要

Abstract:Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: this https URL

19. 【2604.05959】Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

链接https://arxiv.org/abs/2604.05959

作者:Ioannis Nasios

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:disaster risk reduction, support disaster risk, timely detection approaches, Synthetic Aperture Radar, landslide detection

备注

点击查看摘要

Abstract:Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in this https URL.

20. 【2604.05947】Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

链接https://arxiv.org/abs/2604.05947

作者:Tianyi Liu,Yiming Li,Wenqian Wang,Jiaojiao Wang,Chen Cai,Yi Wang,Kim-Hui Yap

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:predefined cross-modal interactions, Robust multimodal visual, visual analytics remains, analytics remains challenging, heterogeneous modalities provide

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for this http URL multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal this http URL validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.

21. 【2604.05934】Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction

链接https://arxiv.org/abs/2604.05934

作者:Ahmet Rasim Emirdagi,Süleyman Aslan,Mısra Yavuz,Görkay Aydemir,Yunus Bilge Kurt,Nasrin Rahimi,Burak Can Biner,M. Akın Yılmaz

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:high-attenuation implants severely, implants severely degrade, standard deep learning, obscuring critical anatomical, critical anatomical structures

备注: Accepted to CVPRW 2026 Med-Reasoner

点击查看摘要

Abstract:Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at this https URL.

22. 【2604.05933】SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

链接https://arxiv.org/abs/2604.05933

作者:Yixin Zhang,Yunzhong Hou,Longqi Li,Zhenyue Qin,Yang Liu,Yue Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mitigate acoustic occlusions, reduce diagnostic ambiguity, perception typically requires, typically requires multiple, requires multiple scan

备注

点击查看摘要

Abstract:Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56% kidney coverage and 35.13% cyst coverage, with short trajectories consistently centered on the target cyst.

23. 【2604.05931】Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

链接https://arxiv.org/abs/2604.05931

作者:Jingbo Sun,Qichao Zhang,Songjun Tu,Xing Fang,Yupeng Zheng,Haoran Li,Ke Chen,Dongbin Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:building generalist agents, generalist agents capable, Zero-shot unsupervised reinforcement, unsupervised reinforcement learning, visual URL

备注

点击查看摘要

Abstract:Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

24. 【2604.05908】Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

链接https://arxiv.org/abs/2604.05908

作者:Yangyi Xiao,Siting Zhu,Baoquan Yang,Tianchen Deng,Yongbo Chen,Hesheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digital twin construction, high-fidelity autonomous driving, autonomous driving simulation, Multi-traversal scene reconstruction, twin construction

备注

点击查看摘要

Abstract:Multi-traversal scene reconstruction is important for high-fidelity autonomous driving simulation and digital twin construction. This task involves integrating multiple sequences captured from the same geographical area at different times. In this context, a primary challenge is the significant appearance inconsistency across traversals caused by varying illumination and environmental conditions, despite the shared underlying geometry. This paper presents ADM-GS (Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction), a framework that applies an explicit appearance decomposition to the static background to alleviate appearance entanglement across traversals. For the static background, we decompose the appearance into traversal-invariant material, representing intrinsic material properties, and traversal-dependent illumination, capturing lighting variations. Specifically, we propose a neural light field that utilizes a frequency-separated hybrid encoding strategy. By incorporating surface normals and explicit reflection vectors, this design separately captures low-frequency diffuse illumination and high-frequency specular reflections. Quantitative evaluations on the Argoverse 2 and Waymo Open datasets demonstrate the effectiveness of ADM-GS. In multi-traversal experiments, our method achieves a +0.98 dB PSNR improvement over existing latent-based baselines while producing more consistent appearance across traversals. Code will be available at this https URL.

25. 【2604.05906】Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

链接https://arxiv.org/abs/2604.05906

作者:Jungwon Park,Jungmin Ko,Dongnam Byun,Wonjong Rhee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:interpret model behavior, boost application performance, Numerous studies, utilized cross-attention maps, generative models

备注

点击查看摘要

Abstract:Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

26. 【2604.05900】AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

链接https://arxiv.org/abs/2604.05900

作者:Dong She,Xianrong Yao,Liqun Chen,Jinghe Yu,Yang Gao,Zhanpeng Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Content Analysis, Affective Image Content, holistic Affective Image, Vision-Language Models, Content Analysis

备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

27. 【2604.05898】Physics-Aware Video Instance Removal Benchmark

链接https://arxiv.org/abs/2604.05898

作者:Zirui Li,Xinghao Chen,Lingyu Jiang,Dengzhe Hou,Fangzhou Lin,Kazunori Yamada,Xiangbo Gao,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Instance Removal, requires removing target, maintaining background integrity, removing target objects, Video Instance

备注

点击查看摘要

Abstract:Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.

28. 【2604.05877】Automatic dental superimposition of 3D intraorals and 2D photographs for human identification

链接https://arxiv.org/abs/2604.05877

作者:Antonio D. Villegas-Yeguas,Xavier Abreau-Freire,Guillermo R-García,Andrea Valsecchi,Teresa Pinho,Daniel Pérez-Mongiovi,Oscar Ibáñez,Oscar Cordón

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:primary identification method, DNA profiling, fingerprints and DNA, considered a primary, primary identification

备注: 10 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images.

Comments:
10 pages, 9 figures, 3 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.05877 [cs.CV]

(or
arXiv:2604.05877v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05877

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2604.05856】Neural Network Pruning via QUBO Optimization

链接https://arxiv.org/abs/2604.05856

作者:Osama Orabi,Artur Zagitov,Hadi Salloum,Viktor A. Lobachev,Kasymkhan Khubiev,Yaroslav Kholodov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:existing approaches rely, ignore complex interactions, Unconstrained Binary Optimization, Quadratic Unconstrained Binary, combinatorial optimization problem

备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Neural network pruning can be formulated as a combinatorial optimization problem, yet most existing approaches rely on greedy heuristics that ignore complex interactions between filters. Formal optimization methods such as Quadratic Unconstrained Binary Optimization (QUBO) provide a principled alternative but have so far underperformed due to oversimplified objective formulations based on metrics like the L1-norm. In this work, we propose a unified Hybrid QUBO framework that bridges heuristic importance estimation with global combinatorial optimization. Our formulation integrates gradient-aware sensitivity metrics - specifically first-order Taylor and second-order Fisher information - into the linear term, while utilizing data-driven activation similarity in the quadratic term. This allows the QUBO objective to jointly capture individual filter relevance and inter-filter functional redundancy. We further introduce a dynamic capacity-driven search to strictly enforce target sparsity without distorting the optimization landscape. Finally, we employ a two-stage pipeline featuring a Tensor-Train (TT) Refinement stage - a gradient-free optimizer that fine-tunes the QUBO-derived solution directly against the true evaluation metric. Experiments on the SIDD image denoising dataset demonstrate that the proposed Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with TT Refinement providing further consistent gains at appropriate combinatorial scales. This highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

30. 【2604.05853】Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

链接https://arxiv.org/abs/2604.05853

作者:Zonghao Ying,Haowen Dai,Lianyu Hu,Zonglei Jing,Quanchen Zou,Yaodong Yang,Aishan Liu,Xianglong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paragraph-length text, render legible, enabling a fundamentally, class of misuse, fundamentally new class

备注

点击查看摘要

Abstract:Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

31. 【2604.05819】Learn to Rank: Visual Attribution by Learning Importance Ranking

链接https://arxiv.org/abs/2604.05819

作者:David Schinagl,Christian Fruhwirth-Reisinger,Alexander Prutsch,Samuel Schulter,Horst Possegger

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Interpreting the decisions, complex computer vision, trust and accountability, safety-critical domains, decisions of complex

备注

点击查看摘要

Abstract:Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model's prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

32. 【2604.05818】WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

链接https://arxiv.org/abs/2604.05818

作者:Yingjian Zhu,Xinming Wang,Kun Ding,Ying Wang,Bin Fan,Shiming Xiang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Visual Question Answering, Knowledge-Based Visual Question, Question Answering, Visual Question, highly effective paradigm

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on this https URL.

33. 【2604.05794】EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

链接https://arxiv.org/abs/2604.05794

作者:Da Li,Dominik Engel,Deng Luo,Ivan Viola

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:virtual human modeling, Strand-level hair geometry, fundamental problem, problem in virtual, virtual human

备注: 10 pages, 6 figures, conference

点击查看摘要

Abstract:Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

34. 【2604.05793】BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents

链接https://arxiv.org/abs/2604.05793

作者:Bo Ma,Jinsong Wu,Weiqi Yan

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:single model call, raw user content, VLM agents, prompt privacy risk, privacy risk propagates

备注

点击查看摘要

Abstract:In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, semantic abstraction, or secure symbolic mapping, and delays restoration to authorized boundaries. Relative to enterprise redaction, this adds explicit propagation-aware mediation and restoration timing as a security variable. Under controlled evaluation on the Controlled Prompt-Privacy Benchmark (CPPB), stage-wise propagation suppresses from 10.7\% to 7.1\% across retrieval, memory, and tool stages; PER reaches 9.3\% with 0.94 AC and 0.92 TSR, outperforming generic de-identification. These are controlled systems results on CPPB rather than formal privacy guarantees or public-benchmark transfer claims. The project repository is available at this https URL.

Subjects:

Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.05793 [cs.CR]

(or
arXiv:2604.05793v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2604.05793

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2604.05788】Sparse Gain Radio Map Reconstruction With Geometry Priors and Uncertainty-Guided Measurement Selection

链接https://arxiv.org/abs/2604.05788

作者:Zhihan Zeng,Ning Wei,Muhammad Baqer Mollah,Kaihe Wang,Phee Lep Yeoh,Fei Xu,Yue Xiu,Zhongpei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:environment-aware wireless communication, radio resource optimization, radio map, wireless communication, resource optimization

备注

点击查看摘要

Abstract:Radio maps are important for environment-aware wireless communication, network planning, and radio resource optimization. However, dense radio map construction remains challenging when only a limited number of measurements are available, especially in complex urban environments with strong blockages, irregular geometry, and restricted sensing accessibility. Existing methods have explored interpolation, low-rank cartography, deep completion, and channel knowledge map (CKM) construction, but many of these methods insufficiently exploit explicit geometric priors or overlook the value of predictive uncertainty for subsequent sensing. In this paper, we study sparse gain radio map reconstruction from a geometry-aware and active sensing perspective. We first construct \textbf{UrbanRT-RM}, a controllable ray-tracing benchmark with diverse urban layouts, multiple base-station deployments, and multiple sparse sampling modes. We then propose \textbf{GeoUQ-GFNet}, a lightweight network that jointly predicts a dense gain radio map and a spatial uncertainty map from sparse measurements and structured scene priors. The predicted uncertainty is further used to guide active measurement selection under limited sensing budgets. Extensive experiments show that our proposed GeoUQ-GFNet method achieves strong and consistent reconstruction performance across different scenes and transmitter placements generated using UrbanRT-RM. Moreover, uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget. These results demonstrate the effectiveness of combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation for sparse radio map reconstruction in complex urban environments.

36. 【2604.05781】RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement

链接https://arxiv.org/abs/2604.05781

作者:Junhao Yang,Bo Yang,Hongwei Ge,Yanchun Liang,Heow Pueh Lee,Chunguo Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hinder downstream multimedia, downstream multimedia analysis, correcting color distortion, color distortion, Low-light images

备注: 8 pages, 8 figures

点击查看摘要

Abstract:Low-light images often suffer from severe noise, detail loss, and color distortion, which hinder downstream multimedia analysis and retrieval tasks. The degradation in low-light images is complex: luminance and chrominance are coupled, while within the chrominance, noise and details are deeply entangled, preventing existing methods from simultaneously correcting color distortion, suppressing noise, and preserving fine details. To tackle the above challenges, we propose a novel hierarchical decoupling framework (RHVI-FDD). At the macro level, we introduce the RHVI transform, which mitigates the estimation bias caused by input noise and enables robust luminance-chrominance decoupling. At the micro level, we design a Frequency-Domain Decoupling (FDD) module with three branches for further feature separation. Using the Discrete Cosine Transform, we decompose chrominance features into low, mid, and high-frequency bands that predominantly represent global tone, local details, and noise components, which are then processed by tailored expert networks in a divide-and-conquer manner and fused via an adaptive gating module for content-aware fusion. Extensive experiments on multiple low-light datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches in both objective metrics and subjective visual quality.

37. 【2604.05780】Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

链接https://arxiv.org/abs/2604.05780

作者:Yu Xue,Longjun Gao,Yuanqi Su,HaoAng Lu,Xiaoning Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single RGB image, single RGB, RGB image, Semantic Scene Completion, aims to reconstruct

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

38. 【2604.05773】PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization

链接https://arxiv.org/abs/2604.05773

作者:Shicai Wei,Chunbo Luo,Qiang Zhu,Yang Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted increasing attention, increasing attention due, performance-dominant modality, Multimodal learning, attracted increasing

备注

点击查看摘要

Abstract:Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.

39. 【2604.05767】Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

链接https://arxiv.org/abs/2604.05767

作者:Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Hernan Matzner

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:production ADAS systems, ADAS systems, large-scale ego-centric dashcam, production ADAS, Toggle

备注

点击查看摘要

Abstract:We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar's Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as:
arXiv:2604.05767 [cs.CV]

(or
arXiv:2604.05767v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05767

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Roni Goldshmidt [view email] [v1]
Tue, 7 Apr 2026 12:10:21 UTC (2,554 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0, by Roni Goldshmidt and 3 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

40. 【2604.05761】Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

链接https://arxiv.org/abs/2604.05761

作者:Amadou S. Sangare,Adrien Maglo,Mohamed Chaouch,Bertrand Luvison

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently achieved remarkable, achieved remarkable progress, text alignment, recently achieved, achieved remarkable

备注

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at this https URL

41. 【2604.05748】SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge

链接https://arxiv.org/abs/2604.05748

作者:Dongliang Zhu,Zhiyi Niu,Bo Zhao,Jiajian Huang,Shuo Ye,Xun Lin,Hui Ma,Taorui Wang,Jiayu Zhang,Chunmei Zhu,Junzhe Cao,Yingjie Ma,Rencheng Song,Albert Clapés,Sergio Escalera,Dan Guo,Zitong Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reveal hidden patterns, Subtle visual signals, Subtle visual, naked eye, difficult to perceive

备注: Accepted by the SVC workshop @ CVPR 2026

点击查看摘要

Abstract:Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \href{this https URL}{MMDD2026 platform}\footnote{this https URL}

42. 【2604.05743】On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

链接https://arxiv.org/abs/2604.05743

作者:Amit Vaisman,Gal Pomerants,Raz Lapid

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Modern image compression, Modern image, Reverse Channel Coding, perception trade-off, image compression methods

备注

点击查看摘要

Abstract:Modern image compression methods are typically optimized for the rate--distortion--perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate--distortion--perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.

43. 【2604.05742】ASSR-Net: Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion

链接https://arxiv.org/abs/2604.05742

作者:Qiya Song,Hongzhi Zhou,Lishan Tan,Renwei Dian,Shutao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hyperspectral image fusion, integrating complementary information, image fusion aims, Hyperspectral image, Spectrally Recalibrated Network

备注

点击查看摘要

Abstract:Hyperspectral image fusion aims to reconstruct high-spatial-resolution hyperspectral images (HR-HSI) by integrating complementary information from multi-source inputs. Despite recent progress, existing methods still face two critical challenges: (1) inadequate reconstruction of anisotropic spatial structures, resulting in blurred details and compromised spatial quality; and (2) spectral distortion during fusion, which hinders fine-grained spectral representation. To address these issues, we propose \textbf{ASSR-Net}: an Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion. ASSR-Net adopts a two-stage fusion strategy comprising anisotropic structure-aware spatial enhancement (ASSE) and hierarchical prior-guided spectral calibration (HPSC). In the first stage, a directional perception fusion module adaptively captures structural features along multiple orientations, effectively reconstructing anisotropic spatial patterns. In the second stage, a spectral recalibration module leverages the original low-resolution HSI as a spectral prior to explicitly correct spectral deviations in the fused results, thereby enhancing spectral fidelity. Extensive experiments on various benchmark datasets demonstrate that ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

44. 【2604.05731】FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

链接https://arxiv.org/abs/2604.05731

作者:Mengtian Li,Kunyan Dai,Yi Ding,Ruobing Ni,Ying Zhang,Wenwu Wang,Zhifeng Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhancing immersive auditory, immersive auditory experiences, Foley art plays, audio remains labor-intensive, aligned audio remains

备注

点击查看摘要

Abstract:Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at this https URL .

45. 【2604.05727】Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

链接https://arxiv.org/abs/2604.05727

作者:Ying Liu,Junchao Zhang,Caiyun Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Light Image Enhancement, preserving fine details, fine details makes, handle complex noise, Diffusion models excel

备注

点击查看摘要

Abstract:Diffusion models excel at image restoration via probabilistic modeling of forward noise addition and reverse denoising, and their ability to handle complex noise while preserving fine details makes them well-suited for Low-Light Image Enhancement (LLIE). Mainstream diffusion based LLIE methods either adopt a two-stage pipeline or an auxiliary correction network to refine U-Net outputs, which severs the intrinsic link between enhancement and denoising and leads to suboptimal performance owing to inconsistent optimization objectives. To address these issues, we propose the Signal Attenuation Diffusion Model (SADM), a novel diffusion process that integrates the signal attenuation mechanism into the diffusion pipeline, enabling simultaneous brightness adjustment and noise suppression in a single stage. Specifically, the signal attenuation coefficient simulates the inherent signal attenuation of low-light degradation in the forward noise addition process, encoding the physical priors of low-light degradation to explicitly guide reverse denoising toward the concurrent optimization of brightness recovery and noise suppression, thereby eliminating the need for extra correction modules or staged training relied on by existing methods. We validate that our design maintains consistency with Denoising Diffusion Implicit Models(DDIM) via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.05727 [cs.CV]

(or
arXiv:2604.05727v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05727

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2604.05724】Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

链接https://arxiv.org/abs/2604.05724

作者:Yusung Ro,Jaehyun Choi,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sparse Autoencoders, existing analyses largely, analyses largely focus, CLIP vision encoders, vision encoders

备注: CVPR 2026 Findings

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP's predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

47. 【2604.05721】GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

链接https://arxiv.org/abs/2604.05721

作者:Weiqi Zhang,Junsheng Zhou,Haotian Geng,Kanle Shi,Shenkun Xu,Yi Fang,Yu-Shen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated superior performance, Splatting has demonstrated, proper geometric priors, Gaussian Splatting, efficiency and quality

备注: Accepted by CVPR 2026. Project page: [this https URL](https://weiqi-zhang.github.io/GaussianGrow)

点击查看摘要

Abstract:3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: this https URL

48. 【2604.05718】MPM: Mutual Pair Merging for Efficient Vision Transformers

链接https://arxiv.org/abs/2604.05718

作者:Simon Ravé,Pejman Rasti,David Rousseau

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Decreasing sequence length, reports proxy metrics, Decreasing sequence, prior token reduction, token reduction work

备注: Accepted to CVPR 2026 (Findings)

点击查看摘要

Abstract:Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

49. 【2604.05715】In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

链接https://arxiv.org/abs/2604.05715

作者:Wenhui Xiao,Ethan Goan,Rodrigo Santa Cruz,David Ahmedt-Aristizabal,Olivier Salvado,Clinton Fookes,Leo Lebrat

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mitigate artifacts caused, Gaussian Splatting, sparse training data, textureless surfaces, depth

备注: accepted to CVPR 3DMV Workshop

点击查看摘要

Abstract:Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.

50. 【2604.05695】Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

链接https://arxiv.org/abs/2604.05695

作者:Chongyu Wang,Ting Huang,Chunyu Sun,Xinyu Ning,Di Wang,Hao Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, real-world visual streams, processing real-world visual, achieved remarkable progress, exhibit limited physical

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.

51. 【2604.05689】CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

链接https://arxiv.org/abs/2604.05689

作者:Xuecong Liu,Mengzhu Ding,Zixuan Sun,Zhang Li,Xichao Teng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Feature Flow Transformer, present Consistent-Recurrent Feature, Flow Transformer, Consistent-Recurrent Feature Flow, feature flow learning

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at this https URL.

52. 【2604.05687】3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

链接https://arxiv.org/abs/2604.05687

作者:Xinye Zheng,Fei Wang,Yiqi Nie,Kun Li,Junjie Chen,Jiaqi Zhao,Yanyan Wei,Zhiliang Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong scattering effects, introduces strong scattering, scattering effects, cross-view consistency, Gaussian Splatting

备注

点击查看摘要

Abstract:Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.

53. 【2604.05656】SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

链接https://arxiv.org/abs/2604.05656

作者:Wuyang Luan,Junhui Li,Weiguang Zhao,Wenjian Zhang,Tieru Wu,Rui Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generalist robotic manipulation, ODE steps, introduces substantial latency, inference time, generalist robotic

备注: 10 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Vision-Language-Action (VLA) models based on flow matching -- such as pi0, pi0.5, and SmolVLA -- achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model's own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success -- matching the 10-step teacher at 97.75% and slightly exceeding it -- with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

54. 【2604.05651】Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

链接https://arxiv.org/abs/2604.05651

作者:Jonas Muth,Zdravko Marinov,Simon Reiß

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain largely unexplored, computer vision community, medical computer vision, medical vision tasks, tasks

备注

点击查看摘要

Abstract:While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.

55. 【2604.05649】Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

链接https://arxiv.org/abs/2604.05649

作者:Peixi Peng(1),Housheng Xie(1),Yanling Wei(2),Guangcong Ruan(2),Xiaoyang Zou(1),Qian Cao(3),Yongjian Nian(2),Guoyan Zheng(1) ((1) Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, (2) Daping Hospital, Army Medical University, (3) Sir Run Run Shaw Hospital, Zhejiang University School of Medicine)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:global health burden, growing global health, Gastrointestinal diseases impose, health burden, impose a growing

备注

点击查看摘要

Abstract:Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

56. 【2604.05638】PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

链接https://arxiv.org/abs/2604.05638

作者:Ruilin Tang,Yang Zhou,Zhong Ye,Wenxi Liu,Yan Huang,Shengfeng He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding dynamic, Understanding, language queries requires, accurate scene reconstruction, natural language queries

备注

点击查看摘要

Abstract:Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

57. 【2604.05636】owards Athlete Fatigue Assessment from Association Football Videos

链接https://arxiv.org/abs/2604.05636

作者:Xavier Bou,Nathan Correger,Alexandre Cloots,Cédric Gavage,Silvio Giancola,Cédric Schwartz,François Delvaux,Rudi Cloots,Marc Van Droogenbroeck,Anthony Cioppa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:association football due, monitoring is central, central in association, association football, football due

备注

点击查看摘要

Abstract:Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration--speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.

58. 【2604.05632】SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

链接https://arxiv.org/abs/2604.05632

作者:Letian Bai,Chengyu Tao,Juan Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:identify surface defects, aims to identify, identify surface, surface defects, defects on complex

备注

点击查看摘要

Abstract:Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.

59. 【2604.05629】A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

链接https://arxiv.org/abs/2604.05629

作者:Yongchuan Cui,Peng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sensing imagery suffers, Remote sensing imagery, Large-scale Remote Sensing, Remote sensing, Remote Sensing restoration

备注

点击查看摘要

Abstract:Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: this https URL

60. 【2604.05623】DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

链接https://arxiv.org/abs/2604.05623

作者:Xinran Wang,Yuxuan Zhang,Xiao Zhang,Haolong Yan,Muxi Diao,Songyu Xu,Zhonghao Yan,Hongbing Li,Kongming Liang,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:ensuring high reliability, Accurately detecting, Multimodal Large Language, Large Language Models, detecting and localizing

备注: 8 pages, 5 figures. The dataset and code are available at [this https URL](https://zyx-hhnkh.github.io/DetailVerifyBench/)

点击查看摘要

Abstract:Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at this https URL.

61. 【2604.05621】FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

链接https://arxiv.org/abs/2604.05621

作者:Alexandros Delitzas,Chenyangguang Zhang,Alexey Gavryushin,Tommaso Di Mario,Boyang Sun,Rishabh Dabral,Leonidas Guibas,Christian Theobalt,Marc Pollefeys,Francis Engelmann,Daniel Barath

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB-D interaction videos, egocentric RGB-D interaction, egocentric RGB-D, indoor scenes directly, reconstructing functional

备注: CVPR 2026. Project page: [this https URL](https://functionalscenes.github.io)

点击查看摘要

Abstract:We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

62. 【2604.05620】Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

链接https://arxiv.org/abs/2604.05620

作者:Chenyu Xue,Yiran Liu,Mian Zhou,Jionglong Su,Zhixiang Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:image segmentation driven, Medical image segmentation, free-text clinical instructions, computer-aided diagnosis, image segmentation

备注

点击查看摘要

Abstract:Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

63. 【2604.05616】Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

链接https://arxiv.org/abs/2604.05616

作者:Dustin Eisenhardt,Timothy Schaumlöffel,Alperen Kantarci,Gemma Roig

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Deep learning models, synthetic data due, Deep learning, real-world settings, learning models

备注

点击查看摘要

Abstract:Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas} benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.

64. 【2604.05605】INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

链接https://arxiv.org/abs/2604.05605

作者:Nikolaos D. Tantaroudas,Andrew J. McCracken,Ilias Karachalios,Evangelos Papatheou

类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:offer limited support, World Health Organisation, Video conferencing, platforms offer limited, Health Organisation estimates

备注: 20

点击查看摘要

Abstract:Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].

65. 【2604.05601】ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

链接https://arxiv.org/abs/2604.05601

作者:Zhaohong Huang,Wenjing Liu,Yuxin Zhang,Fei Chao,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision-language models, Recent advances, vision-language models, explored visual token, advances have explored

备注

点击查看摘要

Abstract:Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

66. 【2604.05595】Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

链接https://arxiv.org/abs/2604.05595

作者:Baoshun Tong,Haoran He,Ling Pan,Yang Liu,Liang Lin

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, models have achieved, achieved remarkable, textbf, robotic manipulation

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming (\textbf{DAERT}) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including $\pi_0$ and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33\% to 5.85\%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.

67. 【2604.05594】BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration

链接https://arxiv.org/abs/2604.05594

作者:Yujie Yao,Yuhaohang He,Junjie Huang,Zhou Liu,Jiangzhao Li,Yan Qiao,Wen Xiao,Yunsen Liang,Xiaofan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low-resource dermoscopic deployment, dermoscopic deployment, attractive for low-resource, low-resource dermoscopic, Annotation-free skin lesion

备注

点击查看摘要

Abstract:Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80\% and 76.97\%, respectively, while approaching supervised reference performance on PH2.

68. 【2604.05584】Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

链接https://arxiv.org/abs/2604.05584

作者:Pengcheng Weng(1,2),Yanyu Qian(1,3),Yangxin Xu(1),Fei Wang(1) ((1) School of Software Engineering, Xi'an Jiaotong University, China, (2) Institute of Computer Science, University of Bern, Switzerland, (3) College of Computing and Data Science, Nanyang Technological University, Singapore)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Robust multimodal human, multimodal human sensing, Robust multimodal, multimodal human, human sensing

备注: Accepted by CVPR 2026 Workshop On Any-to-Any Multimodal Learning

点击查看摘要

Abstract:Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel "Purify-then-Align" framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this "Purify-then-Align" strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

69. 【2604.05583】WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

链接https://arxiv.org/abs/2604.05583

作者:Yizhuo Xu,Chaojian Yu,Yuanjie Shao,Tongliang Liu,Qinmu Peng,Xinge You

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Image Retrieval, target images based, retrieve target images, Composed Image, Image Retrieval

备注

点击查看摘要

Abstract:Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

70. 【2604.05581】High-Resolution Single-Shot Polarimetric Imaging Made Easy

链接https://arxiv.org/abs/2604.05581

作者:Shuangfan Zhou,Chu Zhou,Heng Guo,Youwei Lyu,Boxin Shi,Zhanyu Ma,Imari Sato

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained increasing attention, Polarization-based vision, providing richer physical, richer physical cues, vision has gained

备注

点击查看摘要

Abstract:Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.

71. 【2604.05562】Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection

链接https://arxiv.org/abs/2604.05562

作者:Luqi Gong,Qixin Xie,Yue Chen,Ziqiang Chen,Fanda Fan,Shuai Zhao,Chao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Meta-learning facilitates few-shot, backbones remains challenging, adapting deep backbones, deep backbones remains, facilitates few-shot hyperspectral

备注

点击查看摘要

Abstract:Meta-learning facilitates few-shot hyperspectral target detection (HTD), but adapting deep backbones remains challenging. Full-parameter fine-tuning is inefficient and prone to overfitting, and existing methods largely ignore the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain this http URL address these challenges, we propose SpecMamba, a parameter-efficient and frequency-aware framework that decouples stable semantic representation from agile spectral adaptation. Specifically, we introduce a Discrete Cosine Transform Mamba Adapter (DCTMA) on top of frozen Transformer representations. By projecting spectral features into the frequency domain via DCT and leveraging Mamba's linear-complexity state-space recursion, DCTMA explicitly captures global spectral dependencies and band continuity while avoiding the redundancy of full fine-tuning. Furthermore, to address prototype drift caused by limited sample sizes, we design a Prior-Guided Tri-Encoder (PGTE) that allows laboratory spectral priors to guide the optimization of the learnable adapter without disrupting the stable semantic feature space. Finally, a Self-Supervised Pseudo-Label Mapping (SSPLM) strategy is developed for test-time adaptation, enabling efficient decision boundary refinement through uncertainty-aware sampling and dual-path consistency constraints. Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

72. 【2604.05558】Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities

链接https://arxiv.org/abs/2604.05558

作者:Rongfei Chen,Tingting Zhang,Xiaoyu Shen,Wei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real world scenarios, significantly degrading model, degrading model accuracy, modality problem poses, multimodal sentiment analysis

备注: 6 pages, 3 figures, conference

点击查看摘要

Abstract:The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at this https URL

73. 【2604.05544】Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

链接https://arxiv.org/abs/2604.05544

作者:Jiahua Ma,Yiran Qin,Xin Wen,Yixiong Li,Yuyu Sun,Yulan Guo,Liang Lin,Ruimao Zhang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:visuomotor policy learning, dynamically re-routing trajectories, model relies solely, Referring-Aware Visuomotor Policy, visuomotor policy

备注

点击查看摘要

Abstract:This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out-of-distribution execution errors or dynamically re-routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring-Aware Visuomotor Policy (ReV), a closed-loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high-level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory-steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed-loop process repeats at every execution step, enabling real-time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine-tuning scheme, ReV achieve higher success rates across challenging simulated and real-world tasks.

74. 【2604.05541】EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"

链接https://arxiv.org/abs/2604.05541

作者:Qin Wang,Zhiqing He,Yu Liu,Bowen Guo,Zeju Li,Miao Zhao,Wenhao Ju,Zhiling Luo,Xianhong Shu,Yi Guo,Yuanyuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including visual observation, orchestrate multiple capabilities, synchronously orchestrate multiple, assessing cardiac function, expert knowledge learning

备注: Accepted by CVPR 2026 CV4Clinical, 11 pages, 6 figures

点击查看摘要

Abstract:Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.

75. 【2604.05527】Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

链接https://arxiv.org/abs/2604.05527

作者:Xuanguang Liu,Lei Ding,Yujie Li,Chenguang Dai,Zhenchao Zhang,Mengmeng Li,Ziyi Yang,Yifan Sun,Yongqi Sun,Hanyun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:identifies changed areas, demonstrating significant application, urban sustainable development, multimodal remote sensing, disaster assessment

备注

点击查看摘要

Abstract:Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: this https URL.

76. 【2604.05524】Cross-Resolution Diffusion Models via Network Pruning

链接https://arxiv.org/abs/2604.05524

作者:Jiaxuan Ren,Junhan Zhu,Huan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:impressive image synthesis, demonstrated impressive image, demonstrated impressive, image synthesis performance, impressive image

备注: Accepted by CVPR Findings 2026

点击查看摘要

Abstract:Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.

77. 【2604.05515】Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation

链接https://arxiv.org/abs/2604.05515

作者:Chenxin Yuan,Shoupeng Chen,Haojiang Ye,Yiming Miao,Limei Peng,Pin-Han Ho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Tri-directional Dynamic Nonvoid, Nonvoid Voxel Transformer, Dynamic Nonvoid Voxel, Accurate segmentation, treatment planning

备注: 20 pages, 13 figures, supplementary material included, submitted to Medical Image Analysis

点击查看摘要

Abstract:Accurate segmentation of 3D medical scans is crucial for clinical diagnostics and treatment planning, yet existing methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities. To address these challenges, we propose GCNV-Net, a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT), a Geometrical Cross-Attention module (GCA), and Nonvoid Voxelization. The 3DNVT dynamically partitions relevant voxels along the three orthogonal anatomical planes, namely the transverse, sagittal, and coronal planes, enabling effective modeling of complex 3D spatial dependencies. The GCA mechanism explicitly incorporates geometric positional information during multi-scale feature fusion, significantly enhancing fine-grained anatomical segmentation accuracy. Meanwhile, Nonvoid Voxelization processes only informative regions, greatly reducing redundant computation without compromising segmentation quality, and achieves a 56.13% reduction in FLOPs and a 68.49% reduction in inference latency compared to conventional voxelization. We evaluate GCNV-Net on multiple widely used benchmarks: BraTS2021, ACDC, MSD Prostate, MSD Pancreas, and AMOS2022. Our method achieves state-of-the-art segmentation performance across all datasets, outperforming the best existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and relatively 14.5% on HD95. All results demonstrate that GCNV-Net effectively balances accuracy and efficiency, and its robustness across diverse organs, disease conditions, and imaging modalities highlights strong potential for clinical deployment.

78. 【2604.05510】Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

链接https://arxiv.org/abs/2604.05510

作者:Yanming Xiu,Zhengayuan Jiang,Neil Zhenqiang Gong,Maria Gorlatova

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Augmented reality, past decade, rapidly expanded, Augmented, virtual content

备注: CVPR 2026 Findings

点击查看摘要

Abstract:Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

79. 【2604.05500】CLIP-Guided Data Augmentation for Night-Time Image Dehazing

链接https://arxiv.org/abs/2604.05500

作者:Xining Ge,Weijun Yuan,Gengjia Chang,Xuyang Li,Shuhong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong light interference, haze scattering couples, image dehazing faces, Nighttime image dehazing, non-uniform lighting

备注

点击查看摘要

Abstract:Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.

80. 【2604.05497】hinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

链接https://arxiv.org/abs/2604.05497

作者:Keuntae Kim,Mingyu Kang,Yong Suk Choi

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, Diffusion large language, large language, multimodal large language, diffusion multimodal large

备注: CVPR 2026 - main

点击查看摘要

Abstract:Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model's alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.

81. 【2604.05490】A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

链接https://arxiv.org/abs/2604.05490

作者:Wenbo Zhang,Zekun Long,Zican Liu,Yangchen Zeng,Keyi Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Ground Penetrating Radar, Subsurface defect detection, high wavefield similarity, Ground Penetrating, Penetrating Radar

备注: 8 pages, 7 figures, 5 tables. Accepted by International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Subsurface defect detection via Ground Penetrating Radar is challenged by "weak signals" faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter. We propose WSA-Net, a framework designed to enhance faint signatures through physical-feature reconstruction. Moving beyond simple parameter reduction, WSA-Net integrates four mechanisms: Signal preservation using partial convolutions; Clutter suppression via heterogeneous grouping attention; Geometric reconstruction to sharpen hyperbolic arcs; Context anchoring to resolve semantic ambiguities. Evaluations on the RTSTdataset show WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. Results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

82. 【2604.05484】CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

链接https://arxiv.org/abs/2604.05484

作者:Li Kang,Yutao Fan,Rui Li,Heng Zhou,Yiran Qin,Zhemeng Zhang,Songtao Huang,Xiufeng Song,Zaibin Zhang,Bruno N.Y. Chen,Zhenfei Yin,Dongzhan Zhou,Wangmeng Zuo,Lei Bai

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:systems hold promise, face critical challenges, shared workspace awareness, embodied systems hold, complex collaborative manipulation

备注: 31 pages, 8 figures, including supplementary material. Project page: [this https URL](https://faceong.github.io/CoEnv/)

点击查看摘要

Abstract:Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment -- a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv's effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.

83. 【2604.05482】Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

链接https://arxiv.org/abs/2604.05482

作者:Pu Wang,Zhixuan Mao,Jialu Li,Zhuoran Zheng,Dianjie Lu,Youshan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Automatic diagnosis, diagnosis of canine, challenged by data, data scarcity, Flow Matching

备注

点击查看摘要

Abstract:Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: this https URL).

84. 【2604.05475】A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

链接https://arxiv.org/abs/2604.05475

作者:Kidus Zewde,Yuchen Zhou,Dennis Ng,Neo Tiangratanakul,Tommy Duong,Ankit Raj,Yuxin Zhang,Xingyu Shen,Simiao Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, fundamental asymmetry persists, achieved remarkable capabilities, leverage self-supervised pretraining, massive internet-scale data

备注: Synthetic eye movement dataset generation via 3D eye simulator; iris trajectory replay; script reading detection; behavioral data augmentation

点击查看摘要

Abstract:Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data -- gestures, eye movements, social signals -- remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement -- a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems.

Comments:
Synthetic eye movement dataset generation via 3D eye simulator; iris trajectory replay; script reading detection; behavioral data augmentation

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.05475 [cs.CV]

(or
arXiv:2604.05475v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05475

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
85. 【2604.05449】Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

链接https://arxiv.org/abs/2604.05449

作者:Kang Ding,Hongsong Wang,Jie Gui,Lei He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified representation space, dynamic multi-agent game, autonomous driving resides, representation space, integration of perception

备注: 14 pages, 5 figures

点击查看摘要

Abstract:End-to-end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi-agent game within a unified representation space. Most existing end-to-end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk-Prioritized Game Planning, and propose GameAD, a novel framework that models end-to-end autonomous driving as a risk-aware game problem. The GameAD integrates Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state-of-the-art methods, especially in terms of trajectory safety.

86. 【2604.05445】Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

链接https://arxiv.org/abs/2604.05445

作者:Qiyuan Chen,Hongsen Huang,Jiahe Chen,Qian Shao,Jintai Chen,Hongxia Xu,Renjie Hua,Chuan Ren,Jian Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:reward modeling faces, Vision-language reward modeling, black boxes, faces a dilemma, generative approaches

备注: ACL 2026 Main

点击查看摘要

Abstract:Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

87. 【2604.05436】Human Interaction-Aware 3D Reconstruction from a Single Image

链接https://arxiv.org/abs/2604.05436

作者:Gwanghyun Kim,Junghun James Kim,Suh Yoon Jeon,Jason Park,Se Young Chun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reconstructing textured, digital human applications, human applications, Reconstructing, digital human

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: this https URL

88. 【2604.05433】Few-Shot Semantic Segmentation Meets SAM3

链接https://arxiv.org/abs/2604.05433

作者:Yi-Jen Tsai,Yen-Yu Lin,Chien-Yao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-Shot Semantic Segmentation, Semantic Segmentation, focuses on segmenting, segmenting novel object, object categories

备注: 14 pages, 3 figures

点击查看摘要

Abstract:Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: this https URL

89. 【2604.05431】Cross-Stage Attention Propagation for Efficient Semantic Segmentation

链接https://arxiv.org/abs/2604.05431

作者:Beoungwoo Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent lightweight semantic, lightweight semantic segmentation, semantic segmentation methods, made significant progress, combining compact backbones

备注: 7 pages, 6 figures

点击查看摘要

Abstract:Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder's computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.

90. 【2604.05418】VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

链接https://arxiv.org/abs/2604.05418

作者:Honghao Fu,Miao Xu,Yiwei Wang,Dailing Zhang,Liu Jun,Yujun Cai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Scaling multimodal large, large language models, multimodal large language, limited context windows, Scaling multimodal

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

91. 【2604.05415】Learning to Synergize Semantic and Geometric Priors for Limited-Data Wheat Disease Segmentation

链接https://arxiv.org/abs/2604.05415

作者:Shijie Wang,Zijian Wang,Yadan Luo,Scott Chapman,Xin Yu,Zi Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces severe challenges, significant intra-class temporal, intra-class temporal variations, growth stages, Wheat disease segmentation

备注

点击查看摘要

Abstract:Wheat disease segmentation is fundamental to precision agriculture but faces severe challenges from significant intra-class temporal variations across growth stages. Such substantial appearance shifts make collecting a representative dataset for training from scratch both labor-intensive and impractical. To address this, we propose SGPer, a Semantic-Geometric Prior Synergization framework that treats wheat disease segmentation under limited data as a coupled task of disease-specific semantic perception and disease boundary localization. Our core insight is that pretrained DINOv2 provides robust category-aware semantic priors to handle appearance shifts, which can be converted into coarse spatial prompts to guide SAM for the precise localization of disease boundaries. Specifically, SGPer designs disease-sensitive adapters with multiple disease-friendly filters and inserts them into both DINOv2 and SAM to align their pretrained representations with disease-specific characteristics. To operationalize this synergy, SGPer transforms DINOv2-derived features into dense, category-specific point prompts to ensure comprehensive spatial coverage of all disease regions. To subsequently eliminate prompt redundancy and ensure highly accurate mask generation, it dynamically filters these dense candidates by cross-referencing SAM's iterative mask confidence with the category-specific semantic consistency derived from DINOv2. Ultimately, SGPer distills a highly informative set of prompts to activate SAM's geometric priors, achieving precise and robust segmentation that remains strictly invariant to temporal appearance changes. Extensive evaluations demonstrate that SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

92. 【2604.05414】raining Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations

链接https://arxiv.org/abs/2604.05414

作者:Chris Choy

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:empirical evidence favoring, improves rotation estimation, Recent work, inference improves rotation, SVD

备注

点击查看摘要

Abstract:Recent work has shown that removing orthogonalization during training and applying it only at inference improves rotation estimation in deep learning, with empirical evidence favoring 9D representations with SVD projection. However, the theoretical understanding of why SVD orthogonalization specifically harms training, and why it should be preferred over Gram-Schmidt at inference, remains incomplete. We provide a detailed gradient analysis of SVD orthogonalization specialized to $3 \times 3$ matrices and $SO(3)$ projection. Our central result derives the exact spectrum of the SVD backward pass Jacobian: it has rank $3$ (matching the dimension of $SO(3)$) with nonzero singular values $2/(s_i + s_j)$ and condition number $\kappa = (s_1 + s_2)/(s_2 + s_3)$, creating quantifiable gradient distortion that is most severe when the predicted matrix is far from $SO(3)$ (e.g., early in training when $s_3 \approx 0$). We further show that even stabilized SVD gradients introduce gradient direction error, whereas removing SVD from the training loop avoids this tradeoff entirely. We also prove that the 6D Gram-Schmidt Jacobian has an asymmetric spectrum: its parameters receive unequal gradient signal, explaining why 9D parameterization is preferable. Together, these results provide the theoretical foundation for training with direct 9D regression and applying SVD projection only at inference.

93. 【2604.05409】CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift

链接https://arxiv.org/abs/2604.05409

作者:Yizhou Fang,Pujin Cheng,Yixiang Liu,Xiaoying Tang,Longxi Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical imaging remains, medical imaging, central bottleneck, clinical translation, Distribution shift

备注

点击查看摘要

Abstract:Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called ``Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively ``squeeze'' to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP's superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0\% improvement), 1.90 (13.1\% improvement), and 8.39 (38.9\% improvement) pixels across multi-center, demographic, and modality shifts, respectively.

94. 【2604.05405】Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

链接https://arxiv.org/abs/2604.05405

作者:Hongsheng Li,Lingfeng Zhang,Zexian Yang,Liang Li,Rong Yin,Xiaoshuai Hao,Wenbo Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly challenging due, object detection, detection in adverse, adverse weather, challenging due

备注

点击查看摘要

Abstract:Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

95. 【2604.05402】LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios

链接https://arxiv.org/abs/2604.05402

作者:Xiang Zhang,Tengfei Wang,Fang Xu,Xin Wang,Zongqian Zhan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:remains challenging due, large-scale UAV scenarios, autonomous systems, environmental variations, Gaussian Splatting

备注: This paper is under reviewed by RA-L. The copyright might be transferred upon acceptance

点击查看摘要

Abstract:Visual localization in large-scale UAV scenarios is a critical capability for autonomous systems, yet it remains challenging due to geometric complexity and environmental variations. While 3D Gaussian Splatting (3DGS) has emerged as a promising scene representation, existing 3DGS-based visual localization methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings. To address these limitations, we propose LSGS-Loc, a novel visual localization pipeline tailored for large-scale 3DGS scenes. Specifically, we introduce a scale-aware pose initialization strategy that combines scene-agnostic relative pose estimation with explicit 3DGS scale constraints, enabling geometrically grounded localization without scene-specific training. Furthermore, in the pose refinement, to mitigate the impact of reconstruction artifacts such as blur and floaters, we develop a Laplacian-based reliability masking mechanism that guides photometric refinement toward high-quality regions. Extensive experiments on large-scale UAV benchmarks demonstrate that our method achieves state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches. Code is available at: this https URL

96. 【2604.05393】Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

链接https://arxiv.org/abs/2604.05393

作者:Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Jun Gao,Weiming Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:demonstrated significant potential, Composed Image Retrieval, enabling flexible multimodal, flexible multimodal queries, modification text

备注: Accepted to CVPR 2026. Project page, dataset, and code are available at: [this https URL](https://hahajun1101.github.io/OACIR/)

点击查看摘要

Abstract:Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

97. 【2604.05388】LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning

链接https://arxiv.org/abs/2604.05388

作者:Yizhou Fang,Jian Zhong,Li Lin,Xiaoying Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical Coherence Tomography, Optical Coherence, Coherence Tomography, faces challenges due, heterogeneous label granularities

备注: 5 pages, 2 figures. Accepted to IEEE ISBI 2026. \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

98. 【2604.05378】ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

链接https://arxiv.org/abs/2604.05378

作者:Kaiser Hamid,Can Cui,Nade Liang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:evaluations largely assume, Recent progress, execute natural-language navigation, natural-language navigation commands, largely assume instructions

备注

点击查看摘要

Abstract:Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

99. 【2604.05377】UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

链接https://arxiv.org/abs/2604.05377

作者:Jintao Sun,Hu Zhang,Donglin Di,Gangyi Ding,Zhedong Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicles, high-altitude Unmanned Aerial, Aerial Vehicles, Unmanned Aerial, demonstrated remarkable capability

备注: 20 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.

100. 【2604.05366】3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

链接https://arxiv.org/abs/2604.05366

作者:Jae Joong Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Gaussian Splatting, reconstructors requires learning, method for compressing, reconstructors requires, per-scene fine-tuning

备注: Preprint

点击查看摘要

Abstract:Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer-based 3D reconstructors requires learning a data-dependent codebook through per-scene fine-tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45-dimensional spherical harmonics in 3DGS and 1024-dimensional key-value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data-independent Lloyd-Max quantization near-optimal, within a factor of 2.7 of the information-theoretic lower bound. We develop 3D, deriving (1) a dimension-dependent criterion that predicts which parameters can be quantized and at what bit-width before running any experiment, (2) norm-separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry-grouping strategy extending rotation-based quantization to 2-dimensional hash grid features, and (4) a composable pruning-quantization pipeline with a closed-form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (this https URL)

101. 【2604.05363】Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

链接https://arxiv.org/abs/2604.05363

作者:Rixiang Ni,Boyang Li,Jun Chen,Yonghao Li,Feiyu Ren,Yuji Wang,Haoyang Yuan,Wujiao He,Wei An

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to separate, separate small targets, separate small, IRSTD, clutter backgrounds

备注

点击查看摘要

Abstract:Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided "encoder-decoder" segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: this https URL.

102. 【2604.05359】GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

链接https://arxiv.org/abs/2604.05359

作者:Yang Yi,Xieyuanli Chen,Jinpu Zhang,Hui Shen,Dewen Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, descriptor discriminability, description are foundational, foundational tasks, tasks in computer

备注

点击查看摘要

Abstract:Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: this https URL.

103. 【2604.05354】Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

链接https://arxiv.org/abs/2604.05354

作者:Haochen Yang,Baolu Li,Lei Li,Delin Ren,Jiacheng Guo,Minghai Qin,Tianyun Zhang,Hongkai Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single-agent perception, shown promising performance, automated vehicles, single-agent, shown promising

备注: Accepted to CVPR2026

点击查看摘要

Abstract:The LiDAR-based multi-agent and single-agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised setting.

104. 【2604.05351】AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

链接https://arxiv.org/abs/2604.05351

作者:Yijie Deng,Shuaihang Yuan,Yi Fang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:require precise positioning, coarse success criterion, Image Goal Navigation, precise positioning, sufficient for finding

备注

点击查看摘要

Abstract:Image Goal Navigation (ImageNav) is evaluated by a coarse success criterion, the agent must stop within 1m of the target, which is sufficient for finding objects but falls short for downstream tasks such as grasping that require precise positioning. We introduce AnyImageNav, a training-free system that pushes ImageNav toward this more demanding setting. Our key insight is that the goal image can be treated as a geometric query: any photo of an object, a hallway, or a room corner can be registered to the agent's observations via dense pixel-level correspondences, enabling recovery of the exact 6-DoF camera pose. Our method realizes this through a semantic-to-geometric cascade: a semantic relevance signal guides exploration and acts as a proximity gate, invoking a 3D multi-view foundation model only when the current view is highly relevant to the goal image; the model then self-certifies its registration in a loop for an accurate recovered pose. Our method sets state-of-the-art navigation success rates on Gibson (93.1%) and HM3D (82.6%), and achieves pose recovery that prior methods do not provide: a position error of 0.27m and heading error of 3.41 degrees on Gibson, and 0.21m / 1.23 degrees on HM3D, a 5-10x improvement over adapted baselines.

105. 【2604.05323】VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success

链接https://arxiv.org/abs/2604.05323

作者:Chuhang Liu,Yayun He,Zuheng Kang,Xiaoyang Qu,Jianzong Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:exhibiting broad application, broad application potential, cross-modal semantic alignment, language understanding, exhibiting broad

备注: Accepted to the 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action decision-making for cross-modal semantic alignment, exhibiting broad application potential. However, the joint processing of high-dimensional visual features, complex linguistic inputs, and continuous action sequences incurs significant computational overhead and low inference efficiency, thereby hindering real-time deployment and reliability. To address this issue, we use image entropy to quantify the grayscale distribution characteristics of each visual token and introduce attention entropy to capture the distribution of attention scores over task-related text. Visual entropy identifies texture-rich or structurally informative regions, while attention entropy pinpoints semantically relevant tokens. Combined with timestep information, these metrics enable a dynamic transition strategy that shifts the model's focus from global visual features to attention-guided local informative regions. Thus, the resulting VLA-InfoEntropy method integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content. Extensive experiments show that our method reduces inference parameters, accelerates inference speed, and outperforms existing approaches.

106. 【2604.05316】Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting

链接https://arxiv.org/abs/2604.05316

作者:Monica Tang,Avideh Zakhor

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, indoor asset detection, target indoor assets, drone-captured imagery, present an approach

备注: Accepted to CVPR 2026 3DMV Workshop

点击查看摘要

Abstract:We present an approach for object-level detection and segmentation of target indoor assets in 3D Gaussian Splatting (3DGS) scenes, reconstructed from 360° drone-captured imagery. We introduce a 3D object codebook that jointly leverages mask semantics and spatial information of their corresponding Gaussian primitives to guide multi-view mask association and indoor asset detection. By integrating 2D object detection and segmentation models with semantically and spatially constrained merging procedures, our method aggregates masks from multiple views into coherent 3D object instances. Experiments on two large indoor scenes demonstrate reliable multi-view mask consistency, improving F1 score by 65% over state-of-the-art baselines, and accurate object-level 3D indoor asset detection, achieving an 11% mAP gain over baseline methods.

107. 【2604.05301】SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration

链接https://arxiv.org/abs/2604.05301

作者:Xueming Fu,Lixia Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attenuates scene radiance, simultaneously attenuates scene, smoke simultaneously attenuates, adds airlight, making robust

备注: Lab Report for NTIRE 2026 3DRR Track 2

点击查看摘要

Abstract:Real-world smoke simultaneously attenuates scene radiance, adds airlight, and destabilizes multi-view appearance consistency, making robust 3D reconstruction particularly difficult. We present \textbf{SmokeGS-R}, a practical pipeline developed for the NTIRE 2026 3D Restoration and Reconstruction Track 2 challenge. The key idea is to decouple geometry recovery from appearance correction: we generate physics-guided pseudo-clean supervision with a refined dark channel prior and guided filtering, train a sharp clean-only 3D Gaussian Splatting source model, and then harmonize its renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing. On the official challenge testing leaderboard, the final submission achieved \mbox{PSNR $=15.217$} and \mbox{SSIM $=0.666$}. After the public release of RealX3D, we re-evaluated the same frozen result on the seven released challenge scenes without retraining and obtained \mbox{PSNR $=15.209$}, \mbox{SSIM $=0.644$}, and \mbox{LPIPS $=0.551$}, outperforming the strongest official baseline average on the same scenes by $+3.68$ dB PSNR. These results suggest that a geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective recipe for real-world multi-view smoke restoration. The code is available at this https URL.

108. 【2604.05296】From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

链接https://arxiv.org/abs/2604.05296

作者:Daniel George,Charles Yeh,Daniel Lee,Yifei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Frozen visual embeddings, unmeasured identity leakage, Frozen visual, integrity systems, deployable mitigations

备注: 20 pages, 4 figures

点击查看摘要

Abstract:Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face-context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.

109. 【2604.05272】Final Report, Center for Computer-Integrated Computer-Integrated Surgical Systems and Technology, NSF ERC Cooperative Agreement EEC9731748, Volume 1

链接https://arxiv.org/abs/2604.05272

作者:Russell H. Taylor,Gregory D. Hager,Ralph Etienne-Cummings. Eric Grimson,Ron Kikinis,Cameron Riviere

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Engineering Research Center, National Science Foundation, ten years, CISST ERC, Science Foundation funding

备注

点击查看摘要

Abstract:In the last ten years, medical robotics has moved from the margins to the mainstream. Since the Engineering Research Center for Computer-Integrated Surgical Systems and Technology was Launched in 1998 with National Science Foundation funding, medical robots have been promoted from handling routine tasks to performing highly sophisticated interventions and related assignments. The CISST ERC has played a significant role in this transformation. And thanks to NSF support, the ERC has built the professional infrastructure that will continue our mission: bringing data and technology together in clinical systems that will dramatically change how surgery and other procedures are done. The enhancements we envision touch virtually every aspect of the delivery of care: - More accurate procedures - More consistent, predictable results from one patient to the next - Improved clinical outcomes - Greater patient safety - Reduced liability for healthcare providers - Lower costs for everyone - patients, facilities, insurers, government - Easier, faster recovery for patients - Effective new ways to treat health problems - Healthier patients, and a healthier system The basic science and engineering the ERC is developing now will yield profound benefits for all concerned about health care - from government agencies to insurers, from clinicians to patients to the general public. All will experience the healing touch of medical robotics, thanks in no small part to the work of the CISST ERC and its successors.

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.05272 [cs.RO]

(or
arXiv:2604.05272v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2604.05272

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Russell Taylor [view email] [v1]
Tue, 7 Apr 2026 00:13:20 UTC (3,689 KB)

110. 【2604.05271】oward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition

链接https://arxiv.org/abs/2604.05271

作者:Gabriel E. Lima,Valfride Nascimento,Eduardo Santos,Eduil Nascimento Jr,Rayson Laroca,David Menotti

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Extracting vehicle information, intelligent transportation systems, Extracting vehicle, criminal investigations, Automatic License Plate

备注: Accepted for publication in the Journal of the Brazilian Computer Society (JBCS)

点击查看摘要

Abstract:Extracting vehicle information from surveillance images is essential for intelligent transportation systems, enabling applications such as traffic monitoring and criminal investigations. While Automatic License Plate Recognition (ALPR) is widely used, Fine-Grained Vehicle Classification (FGVC) offers a complementary approach by identifying vehicles based on attributes such as color, make, model, and type. Although there have been advances in this field, existing studies often assume well-controlled conditions, explore limited attributes, and overlook FGVC integration with ALPR. To address these gaps, we introduce UFPR-VeSV, a dataset comprising 24,945 images of 16,297 unique vehicles with annotations for 13 colors, 26 makes, 136 models, and 14 types. Collected from the Military Police of Paraná (Brazil) surveillance system, the dataset captures diverse real-world conditions, including partial occlusions, nighttime infrared imaging, and varying lighting. All FGVC annotations were validated using license plate information, with text and corner annotations also being provided. A qualitative and quantitative comparison with established datasets confirmed the challenging nature of our dataset. A benchmark using five deep learning models further validated this, revealing specific challenges such as handling multicolored vehicles, infrared images, and distinguishing between vehicle models that share a common platform. Additionally, we apply two optical character recognition models to license plate recognition and explore the joint use of FGVC and ALPR. The results highlight the potential of integrating these complementary tasks for real-world applications. The UFPR-VeSV dataset is publicly available at: this https URL.

111. 【2604.05268】Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

链接https://arxiv.org/abs/2604.05268

作者:Chan-Wei Hu,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-modal retrieval-augmented generation, Multi-modal retrieval-augmented, retrieval-augmented generation, relies heavily, image-question queries

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

112. 【2604.05259】Coverage Optimization for Camera View Selection

链接https://arxiv.org/abs/2604.05259

作者:Timothy Chen,Adam Dai,Maximilian Adang,Grace Gao,Mac Schwager

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:makes a good, Fisher Information Gain, active view selection, view selection, good viewpoint

备注

点击查看摘要

Abstract:What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We call this metric COVER (Camera Optimization for View Exploration and Reconstruction). We integrate our method into the Nerfstudio framework and evaluate it on real datasets within fixed and embodied data acquisition scenarios. Across multiple datasets and radiance-field baselines, our method consistently improves reconstruction quality compared to state-of-the-art active view selection methods. Additional visualizations and our Nerfstudio package can be found at this https URL.

113. 【2604.05256】Protecting and Preserving Protest Dynamics for Responsible Analysis

链接https://arxiv.org/abs/2604.05256

作者:Cohen Archbold,Usman Hassan,Nazmus Sakib,Sen-ching Cheung,Abdullah-Al-Zubaer Imran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Protest-related social media, inherently high-risk due, Protest-related social, concerns surrounding surveillance, understanding collective action

备注: 21 pages, 6 figures, Submitted to ACM Journal on Responsible Computing

点击查看摘要

Abstract:Protest-related social media data are valuable for understanding collective action but inherently high-risk due to concerns surrounding surveillance, repression, and individual privacy. Contemporary AI systems can identify individuals, infer sensitive attributes, and cross-reference visual information across platforms, enabling surveillance that poses risks to protesters and bystanders. In such contexts, large foundation models trained on protest imagery risk memorizing and disclosing sensitive information, leading to cross-platform identity leakage and retroactive participant identification. Existing approaches to automated protest analysis do not provide a holistic pipeline that integrates privacy risk assessment, downstream analysis, and fairness considerations. To address this gap, we propose a responsible computing framework for analyzing collective protest dynamics while reducing risks to individual privacy. Our framework replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without direct exposure of identifiable individuals. We demonstrate that our approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk. We further assess demographic fairness in the generated data, examining whether synthetic representations disproportionately affect specific subgroups. Rather than offering absolute privacy guarantees, our method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks.

Comments:
21 pages, 6 figures, Submitted to ACM Journal on Responsible Computing

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.05256 [cs.CV]

(or
arXiv:2604.05256v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.05256

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
114. 【2604.05227】Active Measurement of Two-Point Correlations

链接https://arxiv.org/abs/2604.05227

作者:Max Hamilton,Daniel Sheldon,Subhransu Maji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Two-point correlation functions, correlation functions, points, points cluster, cluster in space

备注: AIStats 2026

点击查看摘要

Abstract:Two-point correlation functions (2PCF) are widely used to characterize how points cluster in space. In this work, we study the problem of measuring the 2PCF over a large set of points, restricted to a subset satisfying a property of interest. An example comes from astronomy, where scientists measure the 2PCF of star clusters, which make up only a tiny subset of possible sources within a galaxy. This task typically requires careful labeling of sources to construct catalogs, which is time-consuming. We present a human-in-the-loop framework for efficient estimation of 2PCF of target sources. By leveraging a pre-trained classifier to guide sampling, our approach adaptively selects the most informative points for human annotation. After each annotation, it produces unbiased estimates of pair counts across multiple distance bins simultaneously. Compared to simple Monte Carlo approaches, our method achieves substantially lower variance while significantly reducing annotation effort. We introduce a novel unbiased estimator, sampling strategy, and confidence interval construction that together enable scalable and statistically grounded measurement of two-point correlations in astronomy datasets.

115. 【2604.05215】Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures

链接https://arxiv.org/abs/2604.05215

作者:Yujian Xiong,Mohammad Farazi,Yanxi Chen,Wenhui Zhu,Xuanzhao Dong,Natasha Lepore,Yi Su,Raza Mushtaq,Stephen Foldes,Andrew Yang,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词:subtle disease-related signals, poses significant challenges, carry subtle disease-related, Representation learning, meshes poses significant

备注

点击查看摘要

Abstract:Representation learning on large-scale unstructured volumetric and surface meshes poses significant challenges in neuroimaging, especially when models must incorporate diverse vertex-level morphometric descriptors, such as cortical thickness, curvature, sulcal depth, and myelin content, which carry subtle disease-related signals. Current approaches either ignore these clinically informative features or support only a single mesh topology, restricting their use across imaging pipelines. We introduce a hierarchical transformer framework designed for heterogeneous mesh analysis that operates on spatially adaptive tree partitions constructed from simplicial complexes of arbitrary order. This design accommodates both volumetric and surface discretizations within a single architecture, enabling efficient multi-scale attention without topology-specific modifications. A feature projection module maps variable-length per-vertex clinical descriptors into the spatial hierarchy, separating geometric structure from feature dimensionality and allowing seamless integration of different neuroimaging feature sets. Self-supervised pretraining via masked reconstruction of both coordinates and morphometric channels on large unlabeled cohorts yields a transferable encoder backbone applicable to diverse downstream tasks and mesh modalities. We validate our approach on Alzheimer's disease classification and amyloid burden prediction using volumetric brain meshes from ADNI, as well as focal cortical dysplasia detection on cortical surface meshes from the MELD dataset, achieving state-of-the-art results across all benchmarks.

116. 【2604.05212】Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

链接https://arxiv.org/abs/2604.05212

作者:Daniel DeTone,Tianwei Shen,Fan Zhang,Lingni Ma,Julian Straub,Richard Newcombe,Jakob Engel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision problem, fundamental computer vision, vision problem, fundamental computer, computer vision

备注: project page: [this http URL](http://facebookresearch.github.io/boxer)

点击查看摘要

Abstract:Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

117. 【2604.05210】Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

链接https://arxiv.org/abs/2604.05210

作者:Muhammad Adil,Mehmood Ahmed,Muhammad Aqib,Vicente A. Gonzalez,Gaang Lee,Qipei Mei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preventing workplace accidents, Accurate and timely, workplace accidents, essential for preventing, preventing workplace

备注

点击查看摘要

Abstract:Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

118. 【2604.05183】OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

链接https://arxiv.org/abs/2604.05183

作者:Ali Aliev,Kamil Garifullin,Nikolay Yudin,Vera Soboleva,Alexander Molozhavenko,Ivan Oseledets,Aibek Alanov,Maxim Rakhuba

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:rapidly growing field, constant practical interest, training data, rapidly growing, growing field

备注

点击查看摘要

Abstract:In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle ($\mathcal{GS}$) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a $\text{spectra restoration}$ transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two $\mathcal{GS}$ orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the $\href{this https URL}{link}$.

119. 【2604.05182】LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

链接https://arxiv.org/abs/2604.05182

作者:Zhengqin Li,Cheng Zhang,Jakob Engel,Zhao Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Sparse Reconstruction, scaling transformer context, introduce the Large, windows impacts feed-forward, transformer context windows

备注

点击查看摘要

Abstract:We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window -- by substantially increasing the number of active object and image tokens -- remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20x more object tokens and 2x more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding 2.5 dB higher PSNR and 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods. Code and model will be released on our project page.

120. 【2604.05180】MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

链接https://arxiv.org/abs/2604.05180

作者:Ziqian Liu,Stephan Alaniz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Instruction-guided image editing, requiring individual edits, Instruction-guided image, multiple similar instances, individual edits

备注

点击查看摘要

Abstract:Instruction-guided image editing has seen remarkable progress with models like FLUX.2 and Qwen-Image-Edit, yet they still struggle with complex scenarios with multiple similar instances each requiring individual edits. We observe that state-of-the-art models suffer from severe over-editing and spatial misalignment when faced with multiple identical instances and composite instructions. To this end, we introduce a comprehensive benchmark specifically designed to evaluate fine-grained consistency in multi-instance and multi-instruction settings. To address the failures of existing methods observed in our benchmark, we propose Multi-Instance Regional Alignment via Guided Editing (MIRAGE), a training-free framework that enables precise, localized editing. By leveraging a vision-language model to parse complex instructions into regional subsets, MIRAGE employs a multi-branch parallel denoising strategy. This approach injects latent representations of target regions into the global representation space while maintaining background integrity through a reference trajectory. Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that our framework significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency. Our benchmark and code are available at this https URL.

121. 【2604.05171】Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

链接https://arxiv.org/abs/2604.05171

作者:Mingjie Li,Edward Kim,Yue Zhao,Ehsan Adeli,Kilian M. Pohl

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:robust Variational Autoencoder, deep learning applications, Variational Autoencoder, robust Variational, deep learning

备注: CVPR Fingdings track

点击查看摘要

Abstract:Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.

122. 【2604.05147】Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging

链接https://arxiv.org/abs/2604.05147

作者:Md Rahatul Islam Udoy,Diego Ferrer,Wantong Li,Kai Ni,Sumeet Kumar Gupta,Ahmedullah Aziz

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:security in image, imaging pipeline, essential as visual, visual data, exposed through multiple

备注

点击查看摘要

Abstract:Ensuring end-to-end security in image sensors has become essential as visual data can be exposed through multiple stages of the imaging pipeline. Advanced protection requires encryption to occur before pixel values appear on any readout lines. This work introduces a secure pixel sensor (SecurePix), a compact CMOS-compatible pixel architecture that performs true in-pixel encryption using a symmetric key realized through programmable, non-volatile multidomain polarization states of a ferroelectric field-effect transistor. The pixel and array operations are designed and simulated in HSPICE, while a 45 nm CMOS process design kit is used for layout drawing. The resulting layout confirms a pixel pitch of 2.33 x 3.01 um^2. Each pixel's non-volatile programming level defines its analog transfer characteristic, enabling the photodiode voltage to be converted into an encrypted analog output within the pixel. Full-image evaluation shows that ResNet-18 recognition accuracy drops from 99.29 percent to 9.58 percent on MNIST and from 91.33 percent to 6.98 percent on CIFAR-10 after encryption, indicating strong resistance to neural-network-based inference. Lookup-table-based inverse mapping enables recovery for authorized receivers using the same symmetric key. Based on HSPICE simulation, the SecurePix achieves a per-pixel programming power-delay product of 17 uW us and a per-pixel sensing power-delay product of 1.25 uW us, demonstrating low-overhead hardware-level protection.

123. 【2604.05117】Watch Before You Answer: Learning from Visually Grounded Post-Training

链接https://arxiv.org/abs/2604.05117

作者:Yuxuan Zhang,EunJeong Hwang,Huaisong Zhang,Penghui Du,Yiming Jia,Dongfu Jiang,Xuan He,Shenhui Zhang,Ping Nie,Peter West,Kelsey R. Allen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:comprehensively understand visual, vision-language models, video understanding, critical for vision-language, comprehensively understand

备注

点击查看摘要

Abstract:It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: this http URL.

124. 【2604.05110】Simultaneous Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

链接https://arxiv.org/abs/2604.05110

作者:Jorge Alberto Garza-Abdala,Gerardo A. Fumagal-González,Eduardo de Avila-Armenta,Sadam Hussain,Jasiel H. Toscano-Martínezb,Diana S. M. Rosales Gurmendi,Alma A. Pedro-Pérez,Jose G. Tamez-Pena

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:provide complementary information, views provide complementary, cancer screening relies, screening relies heavily, heavily on mammography

备注: Accepted and presented at SPIE Medical Imaging 2025 (Vancouver, Canada)

点击查看摘要

Abstract:Breast cancer screening relies heavily on mammography, where the craniocaudal (CC) and mediolateral oblique (MLO) views provide complementary information for diagnosis. However, many datasets lack complete paired views, limiting the development of algorithms that depend on cross-view consistency. To address this gap, we propose a three-channel denoising diffusion probabilistic model capable of simultaneously generating CC and MLO views of a single breast. In this configuration, the two mammographic views are stored in separate channels, while a third channel encodes their absolute difference to guide the model toward learning coherent anatomical relationships between projections. A pretrained DDPM from Hugging Face was fine-tuned on a private screening dataset and used to synthesize dual-view pairs. Evaluation included geometric consistency via automated breast mask segmentation and distributional comparison with real images, along with qualitative inspection of cross-view alignment. The results show that the difference-based encoding helps preserve the global breast structure across views, producing synthetic CC-MLO pairs that resemble real acquisitions. This work demonstrates the feasibility of simultaneous dual-view mammogram synthesis using a difference-guided DDPM, highlighting its potential for dataset augmentation and future cross-view-aware AI applications in breast imaging.

125. 【2604.05079】SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

链接https://arxiv.org/abs/2604.05079

作者:Zhongyu Yang,Zuhao Yang,Shuo Zhan,Tan Yue,Wei Pang,Yingfang Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires integrating spatial, Video question answering, integrating spatial, challenging task, task that requires

备注: Published in CVPR2026

点击查看摘要

Abstract:Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

126. 【2604.05070】Part-Level 3D Gaussian Vehicle Generation with Joint and Hinge Axis Estimation

链接https://arxiv.org/abs/2604.05070

作者:Shiyao Qian,Yuan Ren,Dongfeng Bai,Bingbing Liu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:capture part-level articulation, autonomous driving, essential for autonomous, fail to capture, capture part-level

备注: submitted to IROS 2026

点击查看摘要

Abstract:Simulation is essential for autonomous driving, yet current frameworks often model vehicles as rigid assets and fail to capture part-level articulation. With perception algorithms increasingly leveraging dynamics such as wheel steering or door opening, realistic simulation requires animatable vehicle representations. Existing CAD-based pipelines are limited by library coverage and fixed templates, preventing faithful reconstruction of in-the-wild instances. We propose a generative framework that, from a single image or sparse multi-view input, synthesizes an animatable 3D Gaussian vehicle. Our method addresses two challenges: (i) large 3D asset generators are optimized for static quality but not articulation, leading to distortions at part boundaries when animated; and (ii) segmentation alone cannot provide the kinematic parameters required for motion. To overcome this, we introduce a part-edge refinement module that enforces exclusive Gaussian ownership and a kinematic reasoning head that predicts joint positions and hinge axes of movable parts. Together, these components enable faithful part-aware simulation, bridging the gap between static generation and animatable vehicle models.

Comments:
submitted to IROS 2026

Subjects:

Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

ACMclasses:
I.2.10; I.3.7; I.2.6

Cite as:
arXiv:2604.05070 [cs.AI]

(or
arXiv:2604.05070v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.05070

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
127. 【2604.05060】R3PM-Net: Real-time, Robust, Real-world Point Matching Network

链接https://arxiv.org/abs/2604.05060

作者:Yasaman Kashefbahrami,Erkut Akdag,Panagiotis Meletis,Evgeniya Balmashnova,Dip Goswami,Egor Bondarau

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate Point Cloud, Point Cloud Registration, Cloud Registration, Accurate Point, Point Cloud

备注: Accepted to CVPRw 2026 (Oral), Code and datasets at [this https URL](https://github.com/YasiiKB/R3PM-Net)

点击查看摘要

Abstract:Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point clouds. While deep-learning methods have addressed key limitations of traditional non-learning approaches, such as sensitivity to noise, outliers, occlusion, and initialization, they are developed and evaluated on clean, dense, synthetic datasets (limiting their generalizability to real-world industrial scenarios). This paper introduces R3PM-Net, a lightweight, global-aware, object-level point matching network designed to bridge this gap by prioritizing both generalizability and real-time efficiency. To support this transition, two datasets, Sioux-Cranfield and Sioux-Scans, are proposed. They provide an evaluation ground for registering imperfect photogrammetric and event-camera scans to digital CAD models, and have been made publicly available. Extensive experiments demonstrate that R3PM-Net achieves competitive accuracy with unmatched speed. On ModelNet40, it reaches a perfect fitness score of $1$ and inlier RMSE of $0.029$ cm in only $0.007$s, approximately 7 times faster than the state-of-the-art method RegTR. This performance carries over to the Sioux-Cranfield dataset, maintaining a fitness of $1$ and inlier RMSE of $0.030$ cm with similarly low latency. Furthermore, on the highly challenging Sioux-Scans dataset, R3PM-Net successfully resolves edge cases in under 50 ms. These results confirm that R3PM-Net offers a robust, high-speed solution for critical industrial applications, where precision and real-time performance are indispensable. The code and datasets are available at this https URL.

128. 【2604.05039】ID-Sim: An Identity-Focused Similarity Metric

链接https://arxiv.org/abs/2604.05039

作者:Julia Chae,Nicholas Kolkin,Jui-Hsien Wang,Richard Zhang,Sara Beery,Cusuh Ham

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:highly similar identities, similar identities, easily distinguishing, viewpoints or lighting, remarkable selective sensitivity

备注: SB and CH equal advising; Project page [this https URL](https://juliachae.github.io/id_sim.github.io/)

点击查看摘要

Abstract:Humans have remarkable selective sensitivity to identities -- easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

129. 【2604.05015】Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

链接https://arxiv.org/abs/2604.05015

作者:Chaoyou Fu,Haozhi Yuan,Yuhao Dong,Yi-Fan Zhang,Yunhang Shen,Xiaoxing Hu,Xueying Li,Jinsen Su,Chengwu Long,Xiaoyao Xie,Yongkang Xie,Xiawu Zheng,Xue Yang,Haoyu Cao,Yunsheng Wu,Ziwei Liu,Xing Sun,Caifeng Shan,Ran He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inflated leaderboard scores, real-world model capabilities, video understanding, increasingly saturated, rapid advancement

备注: Homepage: [this https URL](https://video-mme-v2.netlify.app/)

点击查看摘要

Abstract:With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

130. 【2604.05014】StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

链接https://arxiv.org/abs/2604.05014

作者:StarVLA Community

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Building generalist embodied, requires integrating perception, generalist embodied agents, embodied agents requires, agents requires integrating

备注: Open-source VLA infra, Technical Report

点击查看摘要

Abstract:Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at this https URL.

131. 【2604.04997】Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

链接https://arxiv.org/abs/2604.04997

作者:Rong Lu,Hao Liu,Song Hou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:geoscience technical documents, classifying geoscience technical, technical documents, work presents, presents a comparative

备注: Accepted at the IMAGE'25 Workshop (PCW-11), Society of Exploration Geophysicists (SEG). Published version available at [this https URL](https://doi.org/10.1190/image2025-w11-03.1)

点击查看摘要

Abstract:This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-Language Models (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.

132. 【2604.04972】RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

链接https://arxiv.org/abs/2604.04972

作者:Jianwei Zhang,Chaoning Zhang,Sihan Cao,Wang Liu,Pengcheng Zheng,Jiaxin Huang,Caiyan Qin,Yalan Ye,Wei Dong,Yang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, visual tokens processed, prohibitive inference costs, inference costs due

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) suffer from prohibitive inference costs due to the massive number of visual tokens processed by the language decoder. Existing pruning methods often lead to significant performance degradation because the irreversible removal of visual tokens causes a distribution shift in the hidden states that deviates from the pre-trained full-token regime. To address this, we propose Representation Consistency Pruner, which we refer to as RCP, as a novel framework that integrates cumulative visual token pruning with a delayed repair mechanism. Specifically, we introduce a cross-attention pruner that leverages the intrinsic attention of the LLM as a baseline to predict cumulative masks, ensuring consistent and monotonic token reduction across layers. To compensate for the resulting information loss, we design a delayed repair adapter denoted as DRA, which caches the essence of pruned tokens and applies FiLM-based modulation specifically to the answer generation tokens. We employ a repair loss to match the first and second-order statistics of the pruned representations with a full-token teacher. RCP is highly efficient because it trains only lightweight plug-in modules while allowing for physical token discarding at inference. Extensive experiments on LVLM benchmarks demonstrate that RCP removes up to 88.9\% of visual tokens and reduces FLOPs by up to 85.7\% with only a marginal average accuracy drop, and outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.

133. 【2604.04953】Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

链接https://arxiv.org/abs/2604.04953

作者:Abhishek Dharmaratnakar,Srivaths Ranganathan,Debanshu Das,Anushree Sinha

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:profound paradigm shift, Large Language Models, heuristic-based extraction methods, Multimodal Large Language, Large Language

备注: 7 pages, 3 figures, accepted in WSDM 2026

点击查看摘要

Abstract:The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

134. 【2604.05347】CI-ICM: Channel Importance-driven Learned Image Coding for Machines

链接https://arxiv.org/abs/2604.05347

作者:Yun Zhang,Junle Liu,Huan Zhang,Zhaoqing Pan,Gangyi Jiang,Weisi Lin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Traditional human vision-centric, Traditional human, machine vision centric, machine vision, Importance-driven learned Image

备注

点击查看摘要

Abstract:Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$\%$ in object detection and 13.72$\%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.