本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新540篇论文,其中:

  • 自然语言处理81
  • 信息检索11
  • 计算机视觉120

自然语言处理

1. 【2604.03216】BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

链接https://arxiv.org/abs/2604.03216

作者:Sean Wu,Fredrik K. Gustafsson,Edward Phillips,Boyan Gao,Anshul Thakur,David A. Clifton

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, Behavioral Alignment Score, BAS, produce confident

备注: 24 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

2. 【2604.03199】Learning the Signature of Memorization in Autoregressive Language Models

链接https://arxiv.org/abs/2604.03199

作者:David Ilić,Kostadin Cvejoski,David Stanojević,Evgeny Grigorenko

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:reference calibration, prior membership inference, hand-crafted heuristics, designer intuition, membership inference

备注: Preprint. 10 pages, 4 figures, 12 tables

点击查看摘要

Abstract:All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at this https URL.

3. 【2604.03192】Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

链接https://arxiv.org/abs/2604.03192

作者:Dipto Sumit,Ankan Kumar Roy,Sadia Khair Rodela,Atia Haque Asha,Mourchona Afrin,Niloy Farhan,Farig Yousuf Sadeque

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:study multiteacher knowledge, low resource abstractive, Capacity Proportional Divergence, Proportional Divergence Preservation, Entropy Weighted Agreement

备注

点击查看摘要

Abstract:We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

4. 【2604.03180】PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

链接https://arxiv.org/abs/2604.03180

作者:Connor Douglas,Utkucan Balci,Joseph Aylett-Bullock

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:Precision-Informed Semantic Modeling, propose Precision-Informed Semantic, rich representations captured, semantic clustering methods, latent semantic clustering

备注: To appear in Proceedings of the ACM Web Conference 2026 (WWW 26)

点击查看摘要

Abstract:In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.

5. 【2604.03174】Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.03174

作者:Prakhar Bansal,Shivangi Agarwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, encode vast world, vast world knowledge, remain fundamentally limited, finite context windows

备注: 7 pages, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

6. 【2604.03173】Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

链接https://arxiv.org/abs/2604.03173

作者:Delip Rao,Eric Wong,Chris Callison-Burch

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, research agents supply, deep research agents, supply citation URLs

备注: 25 pages

点击查看摘要

Abstract:Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

7. 【2604.03159】BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

链接https://arxiv.org/abs/2604.03159

作者:Delip Rao,Chris Callison-Burch

类目:Digital Libraries (cs.DL); Computation and Language (cs.CL)

关键词:Large language models, Large language, scientific publishing agents, publishing agents, pervasive field-level errors

备注: 37 pages

点击查看摘要

Abstract:Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

8. 【2604.03147】Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

链接https://arxiv.org/abs/2604.03147

作者:Lihao Sun,Lewen Yan,Xiaoya Lu,Andrew Lee,Jie Zhang,Jing Shao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:language model representations, large language model, present a method, method to identify, large language

备注

点击查看摘要

Abstract:We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model's self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens ("I can't," "sorry") occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.

9. 【2604.03144】InCoder-32B-Thinking: Industrial Code World Model for Thinking

链接https://arxiv.org/abs/2604.03144

作者:Jian Yang,Wei Zhang,Jiajun Wu,Junhang Cheng,Tuney Zheng,Fanglin Xu,Weicheng Gu,Lin Jing,Yaxin Du,Joseph Li,Yizhi Li,Yan Xing,Chuan Hao,Ran Tao,Ruihao Gong,Aishan Liu,Zhoujun Li,Mingjie Tang,Chenghua Lin,Siheng Chen,Wayne Xin Zhao,Xianglong Liu,Ming Zhou,Bryan Dai,Weifeng Lv

类目:Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:embedded systems lacks, systems lacks expert, lacks expert reasoning, Industrial software development, chip design

备注

点击查看摘要

Abstract:Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all this http URL Optimization

10. 【2604.03141】Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

链接https://arxiv.org/abs/2604.03141

作者:Nazanin Jafari,James Allan,Mohit Iyyer

类目:Computation and Language (cs.CL)

关键词:fine-grained factual statements, large language models, large language, Evaluating, long-form output generated

备注

点击查看摘要

Abstract:Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

11. 【2604.03136】StoryScope: Investigating idiosyncrasies in AI fiction

链接https://arxiv.org/abs/2604.03136

作者:Jenna Russell,Rishanth Rajendhran,Mohit Iyyer,John Wieting

类目:Computation and Language (cs.CL)

关键词:increasingly prevalent, narrative, stories, narrative features, features

备注

点击查看摘要

Abstract:As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

12. 【2604.03128】Self-Distilled RLVR

链接https://arxiv.org/abs/2604.03128

作者:Chenxu Yang,Chuanyu Qin,Qingyi Si,Minghui Chen,Naibin Gu,Dingyu Yao,Zheng Lin,Weiping Wang,Jiaqi Wang,Nan Duan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:LLM community, popular training paradigm, On-policy distillation, OPD, textbf

备注: Work in progress

点击查看摘要

Abstract:On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

13. 【2604.03127】Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

链接https://arxiv.org/abs/2604.03127

作者:Jinsook Lee,Kirk Vanacore,Zhuqian Zhou,Bakhtawar Ahtisham,Rene F. Kizilcec

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:sufficient domain grounding, Automated annotation, domain grounding, high-stakes task, fail without sufficient

备注: 20 pages, 20 tables, 4 figures

点击查看摘要

Abstract:Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $\kappa$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($\kappa = 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

14. 【2604.03121】An Independent Safety Evaluation of Kimi K2.5

链接https://arxiv.org/abs/2604.03121

作者:Zheng-Xin Yong,Parv Mahajan,Andy Wang,Ida Caspary,Yernat Yestekov,Zora Che,Mosh Levy,Elle Najt,Dennis Murphy,Prashant Kulkarni,Lev McKinney,Kei Nishimura-Gasparian,Ram Potham,Aengus Lynch,Michael L. Chen

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM that rivals, rivals closed models, Kimi, open-weight LLM, rivals closed

备注

点击查看摘要

Abstract:Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying safety evaluation. In this work, we conduct a preliminary safety assessment of Kimi K2.5 focusing on risks likely to be exacerbated by powerful open-weight models. Specifically, we evaluate the model for CBRNE misuse risk, cybersecurity risk, misalignment, political censorship, bias, and harmlessness, in both agentic and non-agentic settings. We find that Kimi K2.5 shows similar dual-use capabilities to GPT 5.2 and Claude Opus 4.5, but with significantly fewer refusals on CBRNE-related requests, suggesting it may uplift malicious actors in weapon creation. On cyber-related tasks, we find that Kimi K2.5 demonstrates competitive cybersecurity performance, but it does not appear to possess frontier-level autonomous cyberoffensive capabilities such as vulnerability discovery and exploitation. We further find that Kimi K2.5 shows concerning levels of sabotage ability and self-replication propensity, although it does not appear to have long-term malicious goals. In addition, Kimi K2.5 exhibits narrow censorship and political bias, especially in Chinese, and is more compliant with harmful requests related to spreading disinformation and copyright infringement. Finally, we find the model refuses to engage in user delusions and generally has low over-refusal rates. While preliminary, our findings highlight how safety risks exist in frontier open-weight models and may be amplified by the scale and accessibility of open-weight releases. Therefore, we strongly urge open-weight model developers to conduct and release more systematic safety evaluations required for responsible deployment.

15. 【2604.03110】Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

链接https://arxiv.org/abs/2604.03110

作者:Zihe Liu,Yulong Mao,Jinan Xu,Xinrui Peng,Kaiyu Huang

类目:Computation and Language (cs.CL)

关键词:Multi-aspect Knowledge Distillation, Knowledge distillation, language model compression, effective technique, technique for pre-trained

备注

点击查看摘要

Abstract:Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

16. 【2604.03098】Co-Evolution of Policy and Internal Reward for Language Agents

链接https://arxiv.org/abs/2604.03098

作者:Xinyu Wang,Hanwei Wu,Jingwei Song,Shuyuan Zhang,Jiayi Zhang,Fanqi Kong,Tung Sum Thomas Kwok,Xiao-Wen Chang,Yuyu Luo,Chenglin Wu,Bang Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, remains fundamentally bottlenecked, long-horizon training remains, training remains fundamentally, Large language

备注: 20 pages, 13 figures

点击查看摘要

Abstract:Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

17. 【2604.03081】Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

链接https://arxiv.org/abs/2604.03081

作者:Yubin Qu,Yi Liu,Tongcheng Geng,Gelei Deng,Yuekang Li,Leo Yu Zhang,Ying Zhang,Lei Ma

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:mandatory security review, LLM-based coding agents, coding agents extend, LLM-based coding, security review

备注

点击查看摘要

Abstract:LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent's action space, such as file writes, shell commands, and network requests, despite existing safeguards. We introduce Document-Driven Implicit Payload Execution (DDIPE), which embeds malicious logic in code examples and configuration templates within skill documentation. Because agents reuse these examples during normal tasks, the payload executes without explicit prompts. Using an LLM-driven pipeline, we generate 1,070 adversarial skills from 81 seeds across 15 MITRE ATTACK categories. Across four frameworks and five models, DDIPE achieves 11.6% to 33.5% bypass rates, while explicit instruction attacks achieve 0% under strong defenses. Static analysis detects most cases, but 2.5% evade both detection and alignment. Responsible disclosure led to four confirmed vulnerabilities and two fixes.

18. 【2604.03058】Verbalizing LLMs' assumptions to explain and control sycophancy

链接https://arxiv.org/abs/2604.03058

作者:Myra Cheng,Isabel Sieh,Humishka Zope,Sunny Yu,Lujain Ibrahim,Aryaman Arora,Jared Moore,Desmond Ong,Dan Jurafsky,Diyi Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:providing genuine assessment, genuine assessment, Verbalized Assumptions, assumptions, providing genuine

备注

点击查看摘要

Abstract:LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

19. 【2604.03057】Querying Structured Data Through Natural Language Using Language Models

链接https://arxiv.org/abs/2604.03057

作者:Hontan Valentin-Micu,Bunea Andrei-Alexandru,Tantaroudas Nikolaos Dimitrios,Popovici Dan-Matei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Unlike Retrieval Augmented, achieves high accuracy, achieve high precision, model achieves high, large proprietary LLMs

备注: in publication

点击查看摘要

Abstract:This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.

20. 【2604.03044】JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

链接https://arxiv.org/abs/2604.03044

作者:Aichen Cai,Anmeng Zhang,Anyu Li,Bo Zhang,Bohua Cai,Chang Li,Changjian Jiang,Changkai Lu,Chao Xue,Chaocai Liang,Cheng Zhang,Dongkai Liu,Fei Wang,Guoqiang Huang,Haijian Ke,Han Lin,Hao Wang,Ji Miao,Jiacheng Zhang,Jialong Shi,Jifeng Zhu,Jingjing Qian,Junhui Luo,Junwu Xiong,Lam So,Liang Huang,Ming Ke,Mingyang Li,Panfeng Shi,Peng Hao,Qi Wang,Qian Lai,Qiaoqiao Yuan,Qingyu Yin,Qiong Cao,Qixiang Wang,Rongcheng Bian,Rongduo Han,Shaoqiang Zheng,Shi Hu,Shi Suo,Shijie Ren,Shijin Zhang,Shiying Fan,Shuai Xie,Tianyi Zhang,Wei Liu,Wentao Tan,Xianghan Meng,Xiaodong He,Xing Pan,Xiran Wang,Xuyang Peng,Ya Zhang,Yang Liu,Yangyang Duan,Yanxu Chen,Yicheng Gong,Yidan Huang,Yifei Liu,Yinhao Bai,Yongqiang Liu,Yuesong Zhang,Yuqi Zhang,Zerui Xie,Zhenfang Wang,Zhennan Shen,Zheyuan Liu,Zhuwei Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:language model designed, Direct Preference Optimization, JoyAI-LLM Flash, introduce JoyAI-LLM Flash, JoyAI-LLM Flash strategically

备注: Xiaodong He is the corresponding author

点击查看摘要

Abstract:We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

21. 【2604.03004】R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

链接https://arxiv.org/abs/2604.03004

作者:Wanlong Liu,Bo Zhang,Chenliang Li,Shaopeng Lai,Yuning Wu,Xuanyu Lei,Ming Yan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:dramatically improved large, improved large language, writing remains unexplored, large language models, domains like mathematics

备注: 31 pages

点击查看摘要

Abstract:While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

22. 【2604.02986】Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

链接https://arxiv.org/abs/2604.02986

作者:Shinnosuke Ono,Johannes Ackermann,Soichiro Nishimori,Takashi Ishida,Masashi Sugiyama

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:true quality plateaus, learned proxy reward, reward hacking, human feedback, true quality

备注: 27 pages, 7 figures

点击查看摘要

Abstract:Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

23. 【2604.02985】Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

链接https://arxiv.org/abs/2604.02985

作者:Cornelius Kummer,Lena Jurkschat,Michael Färber,Sahar Vahdati

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:specifically RAG systems, retrieved passages lead, passages lead large, RAG systems, specifically RAG

备注: Accepted at ECIR 2026 (Full Paper)

点击查看摘要

Abstract:With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

24. 【2604.02972】NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

链接https://arxiv.org/abs/2604.02972

作者:Haonan Dong,Kehan Jiang,Haoran Ye,Wenhao Zhu,Zhaolu Kang,Guojie Song

类目:Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

25. 【2604.02967】FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

链接https://arxiv.org/abs/2604.02967

作者:Kehan Jiang,Haonan Dong,Zhaolu Kang,Zhengzhou Zhu,Guojie Song

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent Large Reasoning, Recent Large, exhibiting human-like patterns, demonstrated remarkable success, exploring multiple alternative

备注

点击查看摘要

Abstract:Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

26. 【2604.02965】Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

链接https://arxiv.org/abs/2604.02965

作者:Zihua Wang,Zhitao Lin,Ruibo Li,Yu Zhang,Xu Yang,Siya Mi,Xiu-Shen Wei

类目:Robotics (cs.RO); Computation and Language (cs.CL)

关键词:large foundation models, shown strong performance, foundation models, manipulation tasks, large foundation

备注: Under Review

点击查看摘要

Abstract:Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: this https URL.

27. 【2604.02954】LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.02954

作者:Yilin Xiao,Jin Chen,Qinggang Zhang,Yujing Zhang,Chuang Zhou,Longhao Yang,Lingfei Ren,Xin Yang,Xiao Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Graph-based Retrieval-Augmented Generation, Language Models, Large Language, Graph-based Retrieval-Augmented

备注

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}this https URL.

28. 【2604.02951】How Annotation Trains Annotators: Competence Development in Social Influence Recognition

链接https://arxiv.org/abs/2604.02951

作者:Maciej Markiewicz,Beata Bajcar,Wiktoria Mieleszczenko-Kowszewicz,Aleksander Szczęsny,Tomasz Adamczyk,Grzegorz Chodak,Karolina Ostrowska,Aleksandra Sawczuk,Jolanta Babiak,Jagoda Szklarczyk,Przemysław Kazienko

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human data annotation, Human data, objective reference, Human, Large Language Model

备注: Accepted to AIED 2026 (27th Conference on Artificial Intelligence in Education)

点击查看摘要

Abstract:Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators' judgments may evolve over time. This study investigates changes in the quality of annotators' work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators' self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.

29. 【2604.02926】A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary

链接https://arxiv.org/abs/2604.02926

作者:K. Skibin,M. Pozhidaev,S. Suschenko

类目:Computation and Language (cs.CL)

关键词:Russian language, article proposes, solve the problem, word vectors includes, morphological tagging

备注: 8 pages, 1 figure, submitted to AINL-2026

点击查看摘要

Abstract:The article proposes a new architecture based on Multi-head attention to solve the problem of morphological tagging for the Russian language. The preprocessing of the word vectors includes splitting the words into subtokens, followed by a trained procedure for aggregating the vectors of the subtokens into vectors for tokens. This allows to support an open dictionary and analyze morphological features taking into account parts of words (prefixes, endings, etc.). The open dictionary allows in future to analyze words that are absent in the training dataset. The performed computational experiment on the SinTagRus and Taiga datasets shows that for some grammatical categories the proposed architecture gives accuracy 98-99% and above, which outperforms previously known results. For nine out of ten words, the architecture precisely predicts all grammatical categories and indicates when the categories must not be analyzed for the word. At the same time, the model based on the proposed architecture can be trained on consumer-level graphics accelerators, retains all the advantages of Multi-head attention over RNNs (RNNs are not used in the proposed approach), does not require pretraining on large collections of unlabeled texts (like BERT), and shows higher processing speed than previous results.

30. 【2604.02923】Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

链接https://arxiv.org/abs/2604.02923

作者:Shuai Wu,Xue Li,Yanna Feng,Yufang Li,Zhijun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, language processing tasks, natural language processing, achieved remarkable capabilities, Large Language

备注: 13 pages, 8 figures, technical report

点击查看摘要

Abstract:Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.

31. 【2604.02910】Analysis of Optimality of Large Language Models on Planning Problems

链接https://arxiv.org/abs/2604.02910

作者:Bernd Bohnet,Michael C. Mozer,Kevin Swersky,Wil Cunningham,Aaron Parisi,Kathleen Kenealy,Noah Fiedel

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Classic AI planning, Language Model, plan efficiency

备注

点击查看摘要

Abstract:Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^*$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^*$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

32. 【2604.02904】BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

链接https://arxiv.org/abs/2604.02904

作者:Wazir Ali,Adeeb Noor,Sanaullah Mahar,Alia,Muhammad Mazhar Younas

类目:Computation and Language (cs.CL)

关键词:Named Entity Recognition, Biomedical Urdu Named, Urdu Named Entity, Entity Recognition, crawling health-related articles

备注

点击查看摘要

Abstract:In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.

33. 【2604.02881】One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

链接https://arxiv.org/abs/2604.02881

作者:Baban Gain,Asif Ekbal,Trilok Nath Singh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:original training data, accessing original training, combines independently fine-tuned, independently fine-tuned models, merging combines independently

备注

点击查看摘要

Abstract:Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.

34. 【2604.02866】LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction

链接https://arxiv.org/abs/2604.02866

作者:Luc Pommeret(STL),Thomas Gerald(LISN),Patrick Paroubek(STL),Sahar Ghannay(STL),Christophe Servan(STL, AMIAD),Sophie Rosset(LISN, STL)

类目:Computation and Language (cs.CL)

关键词:Knowledge Graph construction, requires extracting structured, Graph construction, extracting structured triplets, information-dense sentences

备注

点击查看摘要

Abstract:Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.

35. 【2604.02830】GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

链接https://arxiv.org/abs/2604.02830

作者:Yujing Wang,Yuanbang Liang,Yukun Lai,Hainan Zhang,Hanqi Yan

类目:Computation and Language (cs.CL)

关键词:deploying responsible LLMs, model internal knowledge, responsible LLMs, sufficient to correctly, fundamental challenge

备注

点击查看摘要

Abstract:Detecting whether a model's internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

36. 【2604.02819】Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

链接https://arxiv.org/abs/2604.02819

作者:Chaoqun He,Yingfa Chen,Chaojun Xiao,Xu Han,Lijie Wen

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, models remains challenging, models achieve strong, smaller models remains, achieve strong performance

备注: 17 pages, 6 figures

点击查看摘要

Abstract:Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher's sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

37. 【2604.02795】Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

链接https://arxiv.org/abs/2604.02795

作者:Tianze Xu,Yanzhao Zheng,Pengrui Lu,Lyumanshan Ye,Yong Wu,Zhentao Zhang,Yuanqiang Yu,Chao Ma,Jihuai Zhu,Pengfei Liu,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:aligning Large Language, Rubric-based Reinforcement Learning, Large Language Models, Reinforcement Learning, Large Language

备注

点击查看摘要

Abstract:Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

38. 【2604.02784】EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

链接https://arxiv.org/abs/2604.02784

作者:Ryuhei Miyazato,Shunsuke Kitada,Kei Harada

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:input image, remain vulnerable, factually incorrect, incorrect or ungrounded, Vision-Language Models

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

39. 【2604.02778】When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs

链接https://arxiv.org/abs/2604.02778

作者:Linyu Li,Zhi Jin,Yichi Zhang,Dongming Jin,Yuanpeng He,Haoran Duan,Gadeng Luosang,Nyima Tashi

类目:Computation and Language (cs.CL)

关键词:Real-world multimodal knowledge, knowledge graph reasoning, Real-world multimodal, multimodal knowledge graph, multimodal knowledge

备注

点击查看摘要

Abstract:Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.

40. 【2604.02772】Multiple-Debias: A Full-process Debiasing Method for Multilingual Pre-trained Language Models

链接https://arxiv.org/abs/2604.02772

作者:Haoyu Liang,Peijian Zeng,Wentao Huang,Aimin Yang,Dong Zhou

类目:Computation and Language (cs.CL)

关键词:Pre-trained Language Models, natural language processing, Multilingual Pre-trained Language, Language Models, Multilingual Pre-trained

备注

点击查看摘要

Abstract:Multilingual Pre-trained Language Models (MPLMs) have become essential tools for natural language processing. However, they often exhibit biases related to sensitive attributes such as gender, race, and religion. In this paper, we introduce a comprehensive multilingual debiasing method named Multiple-Debias to address these issues across multiple languages. By incorporating multilingual counterfactual data augmentation and multilingual Self-Debias across both pre-processing and post-processing stages, alongside parameter-efficient fine-tuning, we significantly reduced biases in MPLMs across three sensitive attributes in four languages. We also extended CrowS-Pairs to German, Spanish, Chinese, and Japanese, validating our full-process multilingual debiasing method for gender, racial, and religious bias. Our experiments show that (i) multilingual debiasing methods surpass monolingual approaches in effectively mitigating biases, and (ii) integrating debiasing information from different languages notably improves the fairness of MPLMs.

41. 【2604.02729】IndustryCode: A Benchmark for Industry Code Generation

链接https://arxiv.org/abs/2604.02729

作者:Puyu Zeng,Zhaoxi Wang,Zhixu Duan,Liang Feng,Shaobo Wang,Cunxiang Wang,Jinghang Wang,Bing Zhao,Hu Wei,Linfeng Zhang

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, finding widespread application, comprehension by Large, Large Language, general code generation

备注: 37 pages, 28 figures, 4 tables. Includes appendix

点击查看摘要

Abstract:Code generation and comprehension by Large Language Models (LLMs) have emerged as core drivers of industrial intelligence and decision optimization, finding widespread application in fields such as finance, automation, and aerospace. Although recent advancements have demonstrated the remarkable potential of LLMs in general code generation, existing benchmarks are mainly confined to single domains and languages. Consequently, they fail to effectively evaluate the generalization capabilities required for real-world industrial applications or to reflect the coding proficiency demanded by complex industrial scenarios. To bridge this gap, we introduce IndustryCode, the first comprehensive benchmark designed to span multiple industrial domains and programming languages. IndustryCode comprises 579 sub-problems derived from 125 primary industrial challenges, accompanied by rigorous problem descriptions and test cases. It covers a wide range of fields, including finance, automation, aerospace, and remote sensing-and incorporates diverse programming languages such as MATLAB, Python, C++, and Stata. In our evaluation, the top-performing model, Claude 4.5 Opus, achieved an overall accuracy of 68.1% on sub-problems and 42.5% main problems. The benchmark dataset and automated evaluation code will be made publicly available upon acceptance.

42. 【2604.02718】Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

链接https://arxiv.org/abs/2604.02718

作者:Patrick Pynadath,Jiaxin Shi,Ruqi Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:exciting recent progress, recent progress, Diffusion language, diffusion language modeling, exciting recent

备注

点击查看摘要

Abstract:Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at this https URL.

43. 【2604.02713】Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

链接https://arxiv.org/abs/2604.02713

作者:Jiawen Deng,Wentao Zhang,Ziyun Jiao,Fuji Ren

类目:Computation and Language (cs.CL)

关键词:increasingly deployed, ethically sensitive interactions, ethically sensitive, ethically sensitive behaviors, static safety checks

备注: 22 pages, ACM CHI 2026

点击查看摘要

Abstract:Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts.

44. 【2604.02709】Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

链接https://arxiv.org/abs/2604.02709

作者:Yihong Dong,Xiaoha Jian,Xue Jiang,Xuyuan Guo,Zhiyuan Fan,Jiaru Qian,Kechi Zhang,Jia Li,Zhi Jin,Ge Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:formal reasoning capabilities, automated software engineering, advancing automated software, formal reasoning, reasoning capabilities

备注: Work in progress

点击查看摘要

Abstract:The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

45. 【2604.02699】rivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

链接https://arxiv.org/abs/2604.02699

作者:Rodney Jehu-Appiah

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:selectively altered reasoning, structural signature tied, previous study reported, selectively altered, vocabulary was removed

备注: 19 pages, 10 tables, 3 appendices

点击查看摘要

Abstract:A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like "very" and "just" with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.

46. 【2604.02669】Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

链接https://arxiv.org/abs/2604.02669

作者:Divyanshu Kumar,Ishita Gupta,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi

类目:Computation and Language (cs.CL)

关键词:bias, Abstract, model, language model, castes

备注

点击查看摘要

Abstract:How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Second, safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Third, under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity. These results demonstrate that single-benchmark audits systematically mischaracterize LLM bias and that current alignment practices mask representational harm rather than mitigating it.

47. 【2604.02668】oo Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

链接https://arxiv.org/abs/2604.02668

作者:Vira Kasprova,Amruta Parulekar,Abdulrahman AlRabah,Krishna Agaram,Ritwik Garg,Sagar Jha,Nimet Beyza Bozdag,Dilek Hakkani-Tur

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Large language models, Large language, model opinion, agreement with user, language models

备注

点击查看摘要

Abstract:Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent systems. We ask whether awareness of other agents' sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open-source LLMs, providing agents with peer sycophancy rankings that estimate each peer's tendency toward sycophancy. These rankings are based on scores calculated using various static (pre-discussion) and dynamic (online) strategies. We find that providing sycophancy priors reduces the influence of sycophancy-prone peers, mitigates error-cascades, and improves final discussion accuracy by an absolute 10.5%. Thus, this is a lightweight, effective way to reduce discussion sycophancy and improve downstream accuracy.

48. 【2604.02660】SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

链接https://arxiv.org/abs/2604.02660

作者:Divyanshu Kumar,Ishita Gupta,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi

类目:Computation and Language (cs.CL)

关键词:Large Language Models, increasingly power decision-making, power decision-making systems, Large Language, increasingly power

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage annotation protocol, revealing substantial variation in bias rates (0.42\%-33.75\%). Our findings demonstrate that bias manifests differently across themes lifestyle judgments show 10$\times$ higher bias than education-related decisions and that deployment safeguards effectively prevent explicit discrimination but show brittleness to domain-specific stereotypes. SocioEval provides a scalable, extensible foundation for auditing class-based bias in language models.

49. 【2604.02650】Revealing the Learning Dynamics of Long-Context Continual Pre-training

链接https://arxiv.org/abs/2604.02650

作者:Yupu Liang,Shuang Chen,Guanwei Zhang,Shaolei Wang,Suncong Zheng

类目:Computation and Language (cs.CL)

关键词:Long-Context Continual Pre-training, Continual Pre-training, Existing studies, Long-Context Continual, studies on Long-Context

备注

点击查看摘要

Abstract:Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.

50. 【2604.02645】Speaking of Language: Reflections on Metalanguage Research in NLP

链接https://arxiv.org/abs/2604.02645

作者:Nathan Schneider,Antonios Anastasopoulos

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:work aims, aims to shine, shine a spotlight, Abstract, NLP and LLMs

备注

点击查看摘要

Abstract:This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

51. 【2604.02640】Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

链接https://arxiv.org/abs/2604.02640

作者:Kenichirou Narita,Siqi Peng,Taku Fukui,Moyuru Yamada,Satoshi Munakata,Satoru Takahashi

类目:Computation and Language (cs.CL)

关键词:final accuracy checks, simple final accuracy, Retrieval-Augmented Generation, composite factors extending, accuracy checks

备注: 8 pages, 3 figures. Accepted at AAAI 2026 Workshop

点击查看摘要

Abstract:Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.

Comments:
8 pages, 3 figures. Accepted at AAAI 2026 Workshop

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.02640 [cs.CL]

(or
arXiv:2604.02640v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.02640

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
52. 【2604.02637】rain Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training

链接https://arxiv.org/abs/2604.02637

作者:Qihui Fan,Min Ge,Chenyan Jia,Weiyan Shi

类目:Computation and Language (cs.CL)

关键词:large language models, language models, contexts at scale, large language, opinions and decisions

备注

点击查看摘要

Abstract:As large language models (LLMs) become increasingly persuasive, there is concern that people's opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants' AI literacy ($p .001$), reduced persuasion success across scenarios ($p .05$), and enhanced truthfulness and social responsibility levels ($p0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.

53. 【2604.02621】Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

链接https://arxiv.org/abs/2604.02621

作者:Yiyang Shen,Lifu Tu,Weiran Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reinforcement Learning, existing approaches typically, approaches typically rely, ground truth labels, large language models

备注

点击查看摘要

Abstract:Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

54. 【2604.02596】An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages

链接https://arxiv.org/abs/2604.02596

作者:Yinhan Lu,Gaganpreet Jhajj,Chen Zhang,Anietie Andy,David Ifeoluwa Adelani

类目:Computation and Language (cs.CL)

关键词:In-context learning, large language models, many-shot ICL, making it promising, underrepresented in pre-training

备注: 20 pages, 3 figures, 14 tables

点击查看摘要

Abstract:In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks from a few examples, making it promising for languages underrepresented in pre-training. Recent work on many-shot ICL suggests that modern LLMs can further benefit from larger ICL examples enabled by their long context windows. However, such gains depend on careful example selection, and the inference cost can be prohibitive for low-resource language communities. In this paper, we present an empirical study of many-shot ICL for machine translation from English into ten truly low-resource languages recently added to FLORES+. We analyze the effects of retrieving more informative examples, using out-of-domain data, and ordering examples by length. Our findings show that many-shot ICL becomes more effective as the number of examples increases. More importantly, we show that BM25-based retrieval substantially improves data efficiency: 50 retrieved examples roughly match 250 many-shot examples, while 250 retrieved examples perform similarly to 1,000 many-shot examples.

55. 【2604.02585】Mitigating LLM biases toward spurious social contexts using direct preference optimization

链接https://arxiv.org/abs/2604.02585

作者:Hyunji Nam,Dorottya Demszky

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:introduce harmful biases, LLMs are increasingly, high-stakes decision-making, harmful biases, introduce harmful

备注: 26 pages

点击查看摘要

Abstract:LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers' instructional quality, where biased assessment can affect teachers' professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts -- including teacher experience, education level, demographic identity, and sycophancy-inducing framings -- we find that irrelevant contextual information can shift model predictions by up to 1.48 points on a 7-point scale, with larger models sometimes exhibiting greater sensitivity despite higher predictive accuracy. Mitigations using prompts and standard direct preference optimization (DPO) prove largely insufficient. We propose **Debiasing-DPO**,, a self-supervised training method that pairs neutral reasoning generated from the query alone, with the model's biased reasoning generated with both the query and additional spurious context. We further combine this objective with supervised fine-tuning on ground-truth labels to prevent losses in predictive accuracy. Applied to Llama 3B \ 8B and Qwen 3B \ 7B Instruct models, Debiasing-DPO reduces bias by 84\% and improves predictive accuracy by 52\% on average. Our findings from the educational case study highlight that robustness to spurious context is not a natural byproduct of model scaling and that our proposed method can yield substantial gains in both accuracy and robustness for prompt-based prediction tasks.

56. 【2604.02578】High Volatility and Action Bias Distinguish LLMs from Humans in Group Coordination

链接https://arxiv.org/abs/2604.02578

作者:Sahaj Singh Maini,Robert L. Goldstone,Zoran Tiganj

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)

关键词:exhibit remarkable abilities, remarkable abilities, Group Binary Search, Humans exhibit remarkable, Binary Search

备注

点击查看摘要

Abstract:Humans exhibit remarkable abilities to coordinate in groups. As large language models (LLMs) become more capable, it remains an open question whether they can demonstrate comparable adaptive coordination and whether they use the same strategies as humans. To investigate this, we compare LLM and human performance on a common-interest game with imperfect monitoring: Group Binary Search. In this n-player game, participants need to coordinate their actions to achieve a common objective. Players independently submit numerical values in an effort to collectively sum to a randomly assigned target number. Without direct communication, they rely on group feedback to iteratively adjust their submissions until they reach the target number. Our findings show that, unlike humans who adapt and stabilize their behavior over time, LLMs often fail to improve across games and exhibit excessive switching, which impairs group convergence. Moreover, richer feedback (e.g., numerical error magnitude) benefits humans substantially but has small effects on LLMs. Taken together, by grounding the analysis in human baselines and mechanism-level metrics, including reactivity scaling, switching dynamics, and learning across games, we point to differences in human and LLM groups and provide a behaviorally grounded diagnostic for closing the coordination gap.

57. 【2604.02560】Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

链接https://arxiv.org/abs/2604.02560

作者:Liran Ringel,Ameen Ali,Yaniv Romano

类目:Computation and Language (cs.CL)

关键词:Discrete diffusion language, accelerate text generation, Discrete diffusion, unmasking multiple tokens, accelerate text

备注

点击查看摘要

Abstract:Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2$\times$ speedup on Dream-7B while matching or improving accuracy compared to confidence-based and KL-based baselines.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.02560 [cs.CL]

(or
arXiv:2604.02560v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.02560

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
58. 【2604.02557】Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation

链接https://arxiv.org/abs/2604.02557

作者:Lingjun Zhao,Dayeon Ki,Marine Carpuat,Hal Daumé III

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:open-ended text generation, Language models, exhibit various forms, bias in decision-making, open-ended text

备注

点击查看摘要

Abstract:Language models are known to exhibit various forms of cultural bias in decision-making tasks, yet much less is known about their degree of cultural familiarity in open-ended text generation tasks. In this paper, we introduce the task of culturally-adapted art description generation, where models describe artworks for audiences from different cultural groups who vary in their familiarity with the cultural symbols and narratives embedded in the artwork. To evaluate cultural competence in this pragmatic generation task, we propose a framework based on culturally grounded question answering. We find that base models are only marginally adequate for this task, but, through a pragmatic speaker model, we can improve simulated listener comprehension by up to 8.2%. A human study further confirms that the model with higher pragmatic competence is rated as more helpful for comprehension by 8.0%.

59. 【2604.02554】Principled and Scalable Diversity-Aware Retrieval via Cardinality-Constrained Binary Quadratic Programming

链接https://arxiv.org/abs/2604.02554

作者:Qiheng Lu,Nicholas D. Sidiropoulos

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:face scalability issues, Retrieval-Augmented Generation, lack theoretical guarantees, existing methods lack, methods lack theoretical

备注

点击查看摘要

Abstract:Diversity-aware retrieval is essential for Retrieval-Augmented Generation (RAG), yet existing methods lack theoretical guarantees and face scalability issues as the number of retrieved passages $k$ increases. We propose a principled formulation of diversity retrieval as a cardinality-constrained binary quadratic programming (CCBQP), which explicitly balances relevance and semantic diversity through an interpretable trade-off parameter. Inspired by recent advances in combinatorial optimization, we develop a non-convex tight continuous relaxation and a Frank--Wolfe based algorithm with landscape analysis and convergence guarantees. Extensive experiments demonstrate that our method consistently dominates baselines on the relevance-diversity Pareto frontier, while achieving significant speedup.

60. 【2604.02537】PolyJarvis: LLM Agent for Autonomous Polymer MD Simulations

链接https://arxiv.org/abs/2604.02537

作者:Alexander Zhao,Achuth Chandrasekhar,Amir Barati Farimani

类目:Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)

关键词:All-atom molecular dynamics, execution requires specialized, requires specialized expertise, All-atom molecular, Model Context Protocol

备注

点击查看摘要

Abstract:All-atom molecular dynamics (MD) simulations can predict polymer properties from molecular structure, yet their execution requires specialized expertise in force field selection, system construction, equilibration, and property extraction. We present PolyJarvis, an agent that couples a large language model (LLM) with the RadonPy simulation platform through Model Context Protocol (MCP) servers, enabling end-to-end polymer property prediction from natural language input. Given a polymer name or SMILES string, PolyJarvis autonomously executes monomer construction, charge assignment, polymerization, force field parameterization, GPU-accelerated equilibration, and property calculation. Validation is conducted on polyethylene (PE), atactic polystyrene (aPS), poly(methyl methacrylate) (PMMA), and poly(ethylene glycol) (PEG). Results show density predictions within 0.1--4.8% and bulk moduli within 17--24% of reference values for aPS and PMMA. PMMA glass transition temperature (Tg) (395~K) matches experiment within +10--18~K, while the remaining three polymers overestimate Tg by +38 to +47K (vs upper experimental bounds). Of the 8 property--polymer combinations with directly comparable experimental references, 5 meet strict acceptance criteria. For cases lacking suitable amorphous-phase experimental, agreement with prior MD literature is reported separately. The remaining Tg failures are attributable primarily to the intrinsic MD cooling-rate bias rather than agent error. This work demonstrates that LLM-driven agents can autonomously execute polymer MD workflows producing results consistent with expert-run simulations.

61. 【2604.02512】Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

链接https://arxiv.org/abs/2604.02512

作者:Roland Mühlenbernd

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:increasingly exhibit human-like, Large language models, exhibit human-like patterns, Large language, Effect Size Ratio

备注

点击查看摘要

Abstract:Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

62. 【2604.02486】VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

链接https://arxiv.org/abs/2604.02486

作者:Haz Sameen Shahgir,Xiaofu Chen,Yu Fu,Erfan Shayegani,Nael Abu-Ghazaleh,Yova Kementchedjhieva,Yue Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision Language Models, Vision Language, Language Models, achieve impressive performance, achieve impressive

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

63. 【2604.02485】Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

链接https://arxiv.org/abs/2604.02485

作者:Ayush Rajesh Jhaveri,Anthony GX-Chen,Ilia Sucholutsky,Eunsol Choi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Confirmation bias, challenges one belief, hinders one reasoning, reasoning ability, exhibit confirmation bias

备注

点击查看摘要

Abstract:Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.

64. 【2604.02460】Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

链接https://arxiv.org/abs/2604.02460

作者:Dat Tran,Douwe Kiela

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Recent work reports, work reports strong, Recent work, increased test-time computation, multi-agent LLM systems

备注

点击查看摘要

Abstract:Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

65. 【2604.02459】On the Geometric Structure of Layer Updates in Deep Language Models

链接https://arxiv.org/abs/2604.02459

作者:Jun-Sik Yoo

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:deep language models, layer updates, tokenwise, layer, updates

备注: 11 pages, 5 figures

点击查看摘要

Abstract:We study the geometric structure of layer updates in deep language models. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern language models.

Comments:
11 pages, 5 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.02459 [cs.LG]

(or
arXiv:2604.02459v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.02459

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jun-Sik Yoo [view email] [v1]
Thu, 2 Apr 2026 18:44:34 UTC (206 KB)

66. 【2604.02451】Skeleton-based Coherence Modeling in Narratives

链接https://arxiv.org/abs/2604.02451

作者:Nishit Asnani,Rohan Badlani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:excited NLP researchers, excited NLP, NLP researchers, long time, NLP

备注

点击查看摘要

Abstract:Modeling coherence in text has been a task that has excited NLP researchers since a long time. It has applications in detecting incoherent structures and helping the author fix them. There has been recent work in using neural networks to extract a skeleton from one sentence, and then use that skeleton to generate the next sentence for coherent narrative story generation. In this project, we aim to study if the consistency of skeletons across subsequent sentences is a good metric to characterize the coherence of a given body of text. We propose a new Sentence/Skeleton Similarity Network (SSN) for modeling coherence across pairs of sentences, and show that this network performs much better than baseline similarity techniques like cosine similarity and Euclidean distance. Although skeletons appear to be promising candidates for modeling coherence, our results show that sentence-level models outperform those on skeletons for evaluating textual coherence, thus indicating that the current state-of-the-art coherence modeling techniques are going in the right direction by dealing with sentences rather than their sub-parts.

67. 【2604.02450】Do We Need Frontier Models to Verify Mathematical Proofs?

链接https://arxiv.org/abs/2604.02450

作者:Aaditya Naik,Guruprerana Shabadi,Rajeev Alur,Mayur Naik

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:win gold medals, settle challenging open, Advances in training, challenging open problems, enabled frontier reasoning

备注: 21 pages, 11 figures

点击查看摘要

Abstract:Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

68. 【2604.02423】SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

链接https://arxiv.org/abs/2604.02423

作者:Joy Bhalla,Kristina Gligorić

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, Large language, language models exhibit, user-expressed stances, correctness or consistency

备注

点击查看摘要

Abstract:Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be explicitly anti-sycophantic yields moderate reductions, and can backfire, our counterfactual CoT mitigation drives sycophancy to near zero across models, commitment levels, and clause types, while not suppressing responsiveness to genuine evidence. Overall, we contribute a metric for benchmarking sycophancy and a mitigation informed by it.

69. 【2604.02371】Internalized Reasoning for Long-Context Visual Document Understanding

链接https://arxiv.org/abs/2604.02371

作者:Austin Veselka

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Visual long-document understanding, performing open recipes, Visual long-document, critical for enterprise, scientific applications

备注: 9 pages

点击查看摘要

Abstract:Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{think} tags, gated by a \texttt{cot} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

70. 【2604.02368】Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

链接https://arxiv.org/abs/2604.02368

作者:Xue Liu,Xin Ma,Yuxin Ma,Yongchang Peng,Duo Wang,Zhoufutu Wen,Ge Zhang,Kaiyuan Zhang,Xinyu Chen,Tianci He,Jiani Hou,Liang Hu,Ziyun Huang,Yongzhe Hui,Jianpeng Jiao,Chennan Ju,Yingru Kong,Yiran Li,Mengyun Liu,Luyao Ma,Fei Ni,Yiqing Ni,Yueyan Qiu,Yanle Ren,Zilin Shi,Zaiyuan Wang,Wenjie Yue,Shiyu Zhang,Xinyi Zhang,Kaiwen Zhao,Zhenwei Zhu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, pivotal challenge persists, genuine expert-level cognition, characterizing genuine expert-level

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

71. 【2604.02367】Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

链接https://arxiv.org/abs/2604.02367

作者:Warren Johnson,Charles Lee

类目:Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL)

关键词:requires jointly optimizing, jointly optimizing output, optimizing output quality, requires jointly, governance constraints

备注: 23 pages, 1 figure, 9 tables. Article 8 in the TAAC Research Series. Code and data: [this https URL](https://github.com/micoverde/plexor-slm-frontdoor-rct)

点击查看摘要

Abstract:Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (=0.85 accuracy, =2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.

72. 【2604.02362】CIPHER: Conformer-based Inference of Phonemes from High-density EEG

链接https://arxiv.org/abs/2604.02362

作者:Varshith Madishetty

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:Decoding speech information, scalp EEG remains, EEG remains difficult, remains difficult due, High-density EEG Representations

备注

点击查看摘要

Abstract:Decoding speech information from scalp EEG remains difficult due to low SNR and spatial blurring. We present CIPHER (Conformer-based Inference of Phonemes from High-density EEG Representations), a dual-pathway model using (i) ERP features and (ii) broadband DDA coefficients. On OpenNeuro ds006104 (24 participants, two studies with concurrent TMS), binary articulatory tasks reach near-ceiling performance but are highly confound-vulnerable (acoustic onset separability and TMS-target blocking). On the primary 11-class CVC phoneme task under full Study 2 LOSO (16 held-out subjects), performance is substantially lower (real-word WER: ERP 0.671 +/- 0.080, DDA 0.688 +/- 0.096, indicating limited fine-grained discriminability. We therefore position this work as a benchmark and feature-comparison study rather than an EEG-to-text system, and we constrain neural-representation claims to confound-controlled evidence.

73. 【2604.02359】Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

链接https://arxiv.org/abs/2604.02359

作者:May Lynn Reese,Markela Zeneli,Mindy Ng,Jacob Haimes,Andreea Damien,Elizabeth Stade

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:General-purpose Large Language, Large Language Models, General-purpose Large, Language Models, Large Language

备注: published at IASEAI 2026, preliminary work presented at GenAI4Health workshop at NeurIPS 2025

点击查看摘要

Abstract:General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $\kappa_{\text{human} \times \text{gemini}} = 0.75$, $\kappa_{\text{human} \times \text{qwen}} = 0.68$, $\kappa_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $\kappa_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

74. 【2604.02340】Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

链接https://arxiv.org/abs/2604.02340

作者:Ivan Sedykh,Nikita Sorokin,Valentin Malykh

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:unlike autoregressive decoding, Recent advances, masked diffusion language, full-sequence denoising passes, sampling remains expensive

备注

点击查看摘要

Abstract:Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. On OpenWebText, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality as measured by generative perplexity.

75. 【2604.02339】SIEVE: Sample-Efficient Parametric Learning from Natural Language

链接https://arxiv.org/abs/2604.02339

作者:Parth Asawa,Alexandros G. Dimakis,Matei Zaharia

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:feedback-contains rich signal, Natural language context-such, context-such as instructions, Natural language, adapting language models

备注

点击查看摘要

Abstract:Natural language context-such as instructions, knowledge, or feedback-contains rich signal for adapting language models. While in-context learning provides adaptation via the prompt, parametric learning persists into model weights and can improve performance further, though is data hungry and heavily relies on either high-quality traces or automated verifiers. We propose SIEVE, a method for sample-efficient parametric learning from natural language context that requires as few as three query examples. SIEVE uses a novel synthetic data generation pipeline, SIEVE-GEN, that leverages the insight that context is decomposable. Decomposing context allows us to generate higher quality rollouts by pairing synthetic queries with only the applicable context rather than the entirety, then using context distillation to internalize context into the model. We evaluate in reasoning settings where context is necessary, including custom domains and the RuleArena and Machine Translation from One Book tasks. Our results show that SIEVE outperforms prior context distillation methods using just three query examples, demonstrating how to achieve sample-efficient parametric learning from natural language.

76. 【2604.02338】LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

链接https://arxiv.org/abs/2604.02338

作者:Md Kowsher,Haris Mansoor,Nusrat Jahan Prottasha,Ozlem Garibay,Victor Zhu,Zhengping Ji,Chen Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:methods combine Mixture, combine Mixture, require separate adapters, adapter-based architectures, parameter-efficient fine-tuning

备注

点击查看摘要

Abstract:MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

77. 【2305.18915】Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures

链接https://arxiv.org/abs/2305.18915

作者:Jakob Prange,Emmanuele Chersoni

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:establish empirical lower, predicted semantic structure, attempt successful, empirical lower bounds, work we build

备注: To appear at *SEM 2023, Toronto

点击查看摘要

Abstract:In this work we build upon negative results from an attempt at language modeling with predicted semantic structure, in order to establish empirical lower bounds on what could have made the attempt successful. More specifically, we design a concise binary vector representation of semantic structure at the lexical level and evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance with an end-to-end semantic-bootstrapping language model. We envision such a system as consisting of a (pretrained) sequential-neural component and a hierarchical-symbolic component working together to generate text with low surprisal and high linguistic interpretability. We find that (a) dimensionality of the semantic vector representation can be dramatically reduced without losing its main advantages and (b) lower bounds on prediction quality cannot be established via a single score alone, but need to take the distributions of signal and noise into account.

78. 【2302.08150】Reanalyzing L2 Preposition Learning with Bayesian Mixed Effects and a Pretrained Language Model

链接https://arxiv.org/abs/2302.08150

作者:Jakob Prange,Man Ho Ivy Wong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Chinese learners' pre, English prepositions, set of Chinese, Chinese learners', understanding of English

备注: To appear at ACL 2023, Toronto

点击查看摘要

Abstract:We use both Bayesian and neural models to dissect a data set of Chinese learners' pre- and post-interventional responses to two tests measuring their understanding of English prepositions. The results mostly replicate previous findings from frequentist analyses and newly reveal crucial interactions between student ability, task type, and stimulus sentence. Given the sparsity of the data as well as high diversity among learners, the Bayesian method proves most useful; but we also see potential in using language model probabilities as predictors of grammaticality and learnability.

79. 【2112.07874】Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling

链接https://arxiv.org/abs/2112.07874

作者:Jakob Prange,Nathan Schneider,Lingpeng Kong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:linguistic graph representations, improve neural language, neural language modeling, examine the extent, representations can complement

备注: Accepted to NAACL 2022 (slight typesetting divergences to NAACL camera-ready due to TexLive 2020/2021 mismatches)

点击查看摘要

Abstract:We examine the extent to which, in principle, linguistic graph representations can complement and improve neural language modeling. With an ensemble setup consisting of a pretrained Transformer and ground-truth graphs from one of 7 different formalisms, we find that, overall, semantic constituency structures are most useful to language modeling performance -- outpacing syntactic constituency structures as well as syntactic and semantic dependency structures. Further, effects vary greatly depending on part-of-speech class. In sum, our findings point to promising tendencies in neuro-symbolic language modeling and invite future research quantifying the design choices made by different formalisms.

80. 【2604.03074】Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

链接https://arxiv.org/abs/2604.03074

作者:Zhennan Lin,Shuai Wang,Zhaokai Sun,Pengyuan Xie,Chuan Xie,Jie Liu,Qiang Zhang,Lei Xie

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:requires speech recognition, understanding multi-speaker conversations, multi-speaker conversations requires, conversations requires speech, Transcribing and understanding

备注

点击查看摘要

Abstract:Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.

81. 【2604.02403】Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

链接https://arxiv.org/abs/2604.02403

作者:Cristian Espinal Maya

类目:Econometrics (econ.EM); Computation and Language (cs.CL); Methodology (stat.ME)

关键词:Large Language Models, existing survey instruments, Large Language, Augmented Human Capital, establishes the theoretical

备注: Working paper. 13 pages, 7 figures, 6 references. Part of the Cognitive Factor Economics research program. Code: [this https URL](https://github.com/Cespial/cognitive-factor-economics)

点击查看摘要

Abstract:This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.

Comments:
Working paper. 13 pages, 7 figures, 6 references. Part of the Cognitive Factor Economics research program. Code: this https URL

Subjects:

Econometrics (econ.EM); Computation and Language (cs.CL); Methodology (stat.ME)

Cite as:
arXiv:2604.02403 [econ.EM]

(or
arXiv:2604.02403v1 [econ.EM] for this version)

https://doi.org/10.48550/arXiv.2604.02403

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

信息检索

1. 【2604.03180】PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

链接https://arxiv.org/abs/2604.03180

作者:Connor Douglas,Utkucan Balci,Joseph Aylett-Bullock

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:Precision-Informed Semantic Modeling, propose Precision-Informed Semantic, rich representations captured, semantic clustering methods, latent semantic clustering

备注: To appear in Proceedings of the ACM Web Conference 2026 (WWW 26)

点击查看摘要

Abstract:In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.

2. 【2604.03014】User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

链接https://arxiv.org/abs/2604.03014

作者:Jing Du,Zesheng Ye,Congbo Ma,Feng Liu,Flora. D. Salim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Multi-modal recommendation, visual and textual, textual descriptions, interaction-only recommenders, improve upon interaction-only

备注: 11 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: this https URL.

3. 【2604.02988】Self-Optimizing Multi-Agent Systems for Deep Research

链接https://arxiv.org/abs/2604.02988

作者:Arthur Câmara,Vincent Slot,Jakub Zavrel

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:user complex information, Deep Research systems, Deep Research, system iteratively plans, Research system iteratively

备注: Accepted at the Workshop on Conversational Search for Complex Information Needs at ECIR 2026

点击查看摘要

Abstract:Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.

4. 【2604.02985】Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

链接https://arxiv.org/abs/2604.02985

作者:Cornelius Kummer,Lena Jurkschat,Michael Färber,Sahar Vahdati

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:specifically RAG systems, retrieved passages lead, passages lead large, RAG systems, specifically RAG

备注: Accepted at ECIR 2026 (Full Paper)

点击查看摘要

Abstract:With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

5. 【2604.02833】Bilateral Intent-Enhanced Sequential Recommendation with Embedding Perturbation-Based Contrastive Learning

链接https://arxiv.org/abs/2604.02833

作者:Shanfan Zhang,Yongyi Lin,Yuan Rao

类目:Information Retrieval (cs.IR)

关键词:Accurately modeling users', modeling users' evolving, users' evolving preferences, Accurately modeling, sequential interactions remains

备注: 13 pages, 8 figures

点击查看摘要

Abstract:Accurately modeling users' evolving preferences from sequential interactions remains a central challenge in recommender systems. Recent studies emphasize the importance of capturing multiple latent intents underlying user behaviors. However, existing methods often fail to effectively exploit collective intent signals shared across users and items, leading to information isolation and limited robustness. Meanwhile, current contrastive learning approaches struggle to construct views that are both semantically consistent and sufficiently discriminative. In this work, we propose BIPCL, an end-to-end Bilateral Intent-enhanced, Embedding Perturbation-based Contrastive Learning framework. BIPCL explicitly integrates multi-intent signals into both item and sequence representations via a bilateral intent-enhancement mechanism. Specifically, shared intent prototypes on the user and item sides capture collective intent semantics distilled from behaviorally similar entities, which are subsequently integrated into representation learning. This design alleviates information isolation and improves robustness under sparse supervision. To construct effective contrastive views without disrupting temporal or structural dependencies, BIPCL injects bounded, direction-aware perturbations directly into structural item embeddings. On this basis, BIPCL further enforces multi-level contrastive alignment across interaction- and intent-level representations. Extensive experiments on benchmark datasets demonstrate that BIPCL consistently outperforms state-of-the-art baselines, with ablation studies confirming the contribution of each component.

6. 【2604.02690】AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

链接https://arxiv.org/abs/2604.02690

作者:Teng Lin,Yuyu Luo,Nan Tang

类目:Information Retrieval (cs.IR)

关键词:Unstructured documents dominate, explicit organization hinders, organization hinders precise, hinders precise information, documents dominate enterprise

备注

点击查看摘要

Abstract:Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based optimization, laying a foundation for annotation-driven retrieval and eliminating manual schema design, and Structured Semantic Retrieval (SSR), the core retrieval engine, which unifies semantic understanding with structured query execution; by leveraging the annotated structure instead of vector embeddings, SSR achieves precise semantic matching, seamlessly completing attribute-value extraction, table generation, and progressive SQL-based reasoning without relying on LLM interventions. This annotation-driven paradigm overcomes the limitations of traditional vector-based methods with coarse-grained matching and heavy LLM dependency and graph-based methods with high computational overhead. Experiments on three real-world datasets confirm that AnnoRetrieve significantly lowers LLM call frequency and retrieval cost while maintaining high accuracy. AnnoRetrieve establishes a new paradigm for cost-effective, precise, and scalable document analysis through intelligent structuring.

7. 【2604.02684】MBGR: Multi-Business Prediction for Generative Recommendation at Meituan

链接https://arxiv.org/abs/2604.02684

作者:Changhao Li,Junwei Yin,Zhilin Zeng,Senjie Kou,Shuli Wang,Wenshuai Chen,Yinhua Zhu,Haitao Wang,Xingxing Wang

类目:Information Retrieval (cs.IR)

关键词:Generative recommendation, Multi-Business Generative Recommendation, recently emerged, promising paradigm, paradigm for industrial

备注

点击查看摘要

Abstract:Generative recommendation (GR) has recently emerged as a promising paradigm for industrial recommendations. GR leverages Semantic IDs (SIDs) to reduce the encoding-decoding space and employs the Next Token Prediction (NTP) framework to explore scaling laws. However, existing GR methods suffer from two critical issues: (1) a \textbf{seesaw phenomenon} in multi-business scenarios arises due to NTP's inability to capture complex cross-business behavioral patterns; and (2) a unified SID space causes \textbf{representation confusion} by failing to distinguish distinct semantic information across businesses. To address these issues, we propose Multi-Business Generative Recommendation (MBGR), the first GR framework tailored for multi-business scenarios. Our framework comprises three key components. First, we design a Business-aware semantic ID (BID) module that preserves semantic integrity via domain-aware tokenization. Then, we introduce a Multi-Business Prediction (MBP) structure to provide business-specific prediction capabilities. Furthermore, we develop a Label Dynamic Routing (LDR) module that transforms sparse multi-business labels into dense labels to further enhance the multi-business generation capability. Extensive offline and online experiments on Meituan's food delivery platform validate MBGR's effectiveness, and we have successfully deployed it in production.

8. 【2604.02617】AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

链接https://arxiv.org/abs/2604.02617

作者:Yuntao Du,Minh Dinh,Kaiyuan Zhang,Ninghui Li

类目:Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:analysis requires verifying, rapidly growing literature, requires verifying complex, existing approaches fail, verifying complex technical

备注: Winner of 2025-2026 Radiance Technologies Innovation Bowl

点击查看摘要

Abstract:Scientific and Technical Intelligence (STI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.

9. 【2604.02554】Principled and Scalable Diversity-Aware Retrieval via Cardinality-Constrained Binary Quadratic Programming

链接https://arxiv.org/abs/2604.02554

作者:Qiheng Lu,Nicholas D. Sidiropoulos

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:face scalability issues, Retrieval-Augmented Generation, lack theoretical guarantees, existing methods lack, methods lack theoretical

备注

点击查看摘要

Abstract:Diversity-aware retrieval is essential for Retrieval-Augmented Generation (RAG), yet existing methods lack theoretical guarantees and face scalability issues as the number of retrieved passages $k$ increases. We propose a principled formulation of diversity retrieval as a cardinality-constrained binary quadratic programming (CCBQP), which explicitly balances relevance and semantic diversity through an interpretable trade-off parameter. Inspired by recent advances in combinatorial optimization, we develop a non-convex tight continuous relaxation and a Frank--Wolfe based algorithm with landscape analysis and convergence guarantees. Extensive experiments demonstrate that our method consistently dominates baselines on the relevance-diversity Pareto frontier, while achieving significant speedup.

10. 【2604.02539】Synapse: Evolving Job-Person Fit with Explainable Two-phase Retrieval and LLM-guided Genetic Resume Optimization

链接https://arxiv.org/abs/2604.02539

作者:Ansel Kaplan Erol,Seohee Yoon,Keenan Hom,Xisheng Zhang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:low-relevance applicant pools, Modern recruitment platforms, severe information imbalance, rapidly changing collections, recruitment platforms operate

备注

点击查看摘要

Abstract:Modern recruitment platforms operate under severe information imbalance: job seekers must search over massive, rapidly changing collections of postings, while employers are overwhelmed by high-volume, low-relevance applicant pools. Existing recruitment recommender systems typically rely on keyword matching or single-stage semantic retrieval, which struggle to capture fine-grained alignment between candidate experience and job requirements under real-world scale and cost constraints. We present Synapse, a multi-stage semantic recruitment system that separates high-recall candidate generation from high-precision semantic reranking, combining efficient dense retrieval using FAISS with an ensemble of contrastive learning and Large Language Model (LLM) reasoning. To improve transparency, Synapse incorporates a retrieval-augmented explanation layer that grounds recommendations in explicit evidence. Beyond retrieval, we introduce a novel evolutionary resume optimization framework that treats resume refinement as a black-box optimization problem. Using Differential Evolution with LLM-guided mutation operators, the system iteratively modifies candidate representations to improve alignment with screening objectives, without any labeled data. Evaluation shows that the proposed ensemble improves nDCG@10 by 22% over embedding-only retrieval baselines, while the evolutionary optimization loop consistently yields monotonic improvements in recommender scores, exceeding 60% relative gain across evaluated profiles. We plan to release code and data upon publication.

11. 【2604.02431】SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

链接https://arxiv.org/abs/2604.02431

作者:Matthew McKee

类目:Information Retrieval (cs.IR)

关键词:Retrieving relevant past, relevant past interactions, long-term conversational memory, conversational memory typically, memory typically relies

备注: 12 pages, 12 tables, 3 appendices

点击查看摘要

Abstract:Retrieving relevant past interactions from long-term conversational memory typically relies on large dense retrieval models (110M-1.5B parameters) or LLM-augmented indexing. We introduce SelRoute, a framework that routes each query to a specialized retrieval pipeline -- lexical, semantic, hybrid, or vocabulary-enriched -- based on its query type. On LongMemEval_M (Wu et al., 2024), SelRoute achieves Recall@5 of 0.800 with bge-base-en-v1.5 (109M parameters) and 0.786 with bge-small-en-v1.5 (33M parameters), compared to 0.762 for Contriever with LLM-generated fact keys. A zero-ML baseline using SQLite FTS5 alone achieves NDCG@5 of 0.692, already exceeding all published baselines on ranking quality -- a gap we attribute partly to implementation differences in lexical retrieval. Five-fold stratified cross-validation confirms routing stability (CV gap of 1.3-2.4 Recall@5 points; routes stable for 4/6 query types across folds). A regex-based query-type classifier achieves 83% effective routing accuracy, and end-to-end retrieval with predicted types (Recall@5 = 0.689) still outperforms uniform baselines. Cross-benchmark evaluation on 8 additional benchmarks spanning 62,000+ instances -- including MSDialog, LoCoMo, QReCC, and PerLTQA -- confirms generalization without benchmark-specific tuning, while exposing a clear failure mode on reasoning-intensive retrieval (RECOR Recall@5 = 0.149) that bounds the claim. We also identify an enrichment-embedding asymmetry: vocabulary expansion at storage time improves lexical search but degrades embedding search, motivating per-pipeline enrichment decisions. The full system requires no GPU and no LLM inference at query time.

计算机视觉

1. 【2604.03231】CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

链接https://arxiv.org/abs/2604.03231

作者:Ankan Deria,Komal Kumar,Xilin He,Imran Razzak,Hisham Cholakkal,Fahad Shahbaz Khan,Salman Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent vision-language models, contrastive image-text objectives, Recent vision-language, typically rely, image-text objectives

备注: 16 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

2. 【2604.03225】VOSR: A Vision-Only Generative Model for Image Super-Resolution

链接https://arxiv.org/abs/2604.03225

作者:Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Xiangtao Kong,Jixin Zhao,Shihao Wang,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:web-scale text-image data, adapting large, generative image super-resolution, rely on adapting, web-scale text-image

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at this https URL.

3. 【2604.03212】ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

链接https://arxiv.org/abs/2604.03212

作者:Jiekai Wu,Rong Fu,Chuangqi Li,Zijian Zhang,Guangxin Wu,Hao Zhang,Shiyin Lin,Jianyuan Ni,Yang Li,Dongxu Zhang,Amir H. Gandomi,Simon Fong,Pengbin Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic categories emerge, acquisition conditions shift, categories emerge, shift across seasons, real deployment

备注

点击查看摘要

Abstract:Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation.

4. 【2604.03203】PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

链接https://arxiv.org/abs/2604.03203

作者:Daniel C. MacRae,Luuk van der Hoek,Robert van der Wal,Suzanne P.M. de Vette,Hendrike Neh,Baoqiang Ma,Peter M.A. van Ooijen,Lisanne V. van Dijk

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:computer-aided decision making, Three-dimensional medical image, medical image data, decision making, deep learning

备注: 16 pages, 6 figures and 1 table

点击查看摘要

Abstract:Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

5. 【2604.03198】he Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report

链接https://arxiv.org/abs/2604.03198

作者:Bin Ren,Hang Guo,Yan Shu,Jiaqi Ma,Ziteng Cui,Shuhong Liu,Guofeng Mei,Lei Sun,Zongwei Wu,Fahad Shahbaz Khan,Salman Khan,Radu Timofte,Yawei Li,Hongyuan Yu,Pufan Xu,Chen Wu,Long Peng,Jiaojiao Yi,Siyang Yi,Yuning Cui,Jingyuan Xia,Xing Mou,Keji He,Jinlin Wu,Zongang Gao,Sen Yang,Rui Zheng,Fengguo Li,Yecheng Lei,Wenkai Min,Jie Liu,Keye Cao,Shubham Sharma,Manish Prasad,Haobo Li,Matin Fazel,Abdelhak Bentaleb,Rui Chen,Shurui Shi,Zitao Dai,Qingliang Liu,Yang Cheng,Jing Hu,Xuan Zhang,Rui Ding,Tingyi Zhang,Hui Deng,Mengyang Wang,Fulin Liu,Jing Wei,Qian Wang,Hongying Liu,Mingyang Li,Guanglu Dong,Zheng Yang,Chao Ren,Hongbo Fang,Lingxuan Li,Lin Si,Pan Gao,Moncef Gabbouj,Watchara Ruangsang,Supavadee Aramvith

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reviews the NTIRE, efficient single-image super-resolution, paper reviews, proposed solutions, efficient single-image

备注: CVPR 2026 NTIRE Workshop Paper, Efficient Super Resolution Technical Report

点击查看摘要

Abstract:This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.

6. 【2604.03191】he Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

链接https://arxiv.org/abs/2604.03191

作者:Takuya Shiba

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:downstream manipulation performance, improve downstream manipulation, Diffusion Policy, improve Diffusion Policy, vision-language modeling

备注: 11 pages, 1 figure

点击查看摘要

Abstract:Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.

7. 【2604.03181】Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

链接https://arxiv.org/abs/2604.03181

作者:Peiyan Li,Yixiang Chen,Yuan Xu,Jiabing Yang,Xiangnan Wu,Jun Guo,Nan Sun,Long Qian,Xinghang Li,Xin Xiao,Jing Liu,Nianfeng Liu,Tao Kong,Yan Huang,Liang Wang,Tieniu Tan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:existing policies overlook, manipulation requires understanding, spatial structure, temporal evolution, existing policies

备注: Project Website: [this https URL](https://lpy1219.github.io/MV-VDP-Web/)

点击查看摘要

Abstract:Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.

8. 【2604.03179】Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

链接https://arxiv.org/abs/2604.03179

作者:Gengwei Zhang,Jie Peng,Zhen Tan,Mufan Qiu,Hossein Nourkhiz Mahjoub,Vaishnav Tadiparthi,Kwonjoon Lee,Yanyong Zhang,Tianlong Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Large Language, post-training Multimodal Large, large reasoning models

备注: CVPR 2026

点击查看摘要

Abstract:The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

9. 【2604.03176】SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

链接https://arxiv.org/abs/2604.03176

作者:Wenfeng Zhang,Jun Ni,Yue Meng,Xiaodong Pei,Wei Hu,Qibing Qin,Lei Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:unmanned aerial vehicle, highly challenging task, primarily caused, remains a highly, Object detection

备注: Accepted for publication in IEEE Transactions on Multimedia

点击查看摘要

Abstract:Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and the imbalance of target scales. Traditional methods easily struggle to effectively separate objects from intricate backgrounds and fail to fully leverage the rich multi-scale information contained within images. To address these issues, we have developed a synergistic feature fusion network (SFFNet) with dual-domain edge enhancement specifically tailored for object detection in UAV images. Firstly, the multi-scale dynamic dual-domain coupling (MDDC) module is designed. This component introduces a dual-driven edge extraction architecture that operates in both the frequency and spatial domains, enabling effective decoupling of multi-scale object edges from background noise. Secondly, to further enhance the representation capability of the model's neck in terms of both geometric and semantic information, a synergistic feature pyramid network (SFPN) is proposed. SFPN leverages linear deformable convolutions to adaptively capture irregular object shapes and establishes long-range contextual associations around targets through the designed wide-area perception module (WPM). Moreover, to adapt to the various applications or resource-constrained scenarios, six detectors of different scales (N/S/M/B/L/X) are designed. Experiments on two challenging aerial datasets (VisDrone and UAVDT) demonstrate the outstanding performance of SFFNet-X, achieving 36.8 AP and 20.6 AP, respectively. The lightweight models (N/S) also maintain a balance between detection accuracy and parameter efficiency. The code will be available at this https URL.

10. 【2604.03172】EffiMiniVLM: A Compact Dual-Encoder Regression Framework

链接https://arxiv.org/abs/2604.03172

作者:Yin-Loon Khor,Yi-Jie Wong,Yan Chai Hum

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Predicting product quality, multimodal item information, user interaction history, Predicting product, cold-start scenarios

备注

点击查看摘要

Abstract:Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.

11. 【2604.03156】CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

链接https://arxiv.org/abs/2604.03156

作者:Yuhan Pu,Hao Zheng,Ziqian Mo,Hill Zhang,Tianyi Fan,Shuhong Wu,Jiaheng Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to modify, modify a source, editing, Nano Banana, image editing aims

备注

点击查看摘要

Abstract:Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

12. 【2604.03134】SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

链接https://arxiv.org/abs/2604.03134

作者:Meihua Li,Yang Zhang,Weizhao He,Hu Qu,Yisong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:domain shifts prevalent, Few-Shot Medical Image, Medical Image Segmentation, aims to segment, addressing the critical

备注: CVPR2026

点击查看摘要

Abstract:Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

13. 【2604.03120】SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

链接https://arxiv.org/abs/2604.03120

作者:Xiaoran Zhang,Yu Liu,Jinyu Liang,Kangqiushi Li,Zhiwei Huang,Huaxin Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Unmanned Aerial Vehicles, Navigation Satellite System, Aerial Vehicles, Unmanned Aerial, Global Navigation Satellite

备注: 15 pages, 4 figures. Submitted to IEEE J-STARS

点击查看摘要

Abstract:Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at this https URL.

14. 【2604.03118】Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

链接https://arxiv.org/abs/2604.03118

作者:Xingtong Ge,Yi Zhang,Yushi Huang,Dailan He,Xiahong Wang,Bingqi Ma,Guanglu Song,Yu Liu,Jun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:low inference budgets, extremely low inference, Distilling video generation, Distilling video, inference budgets

备注: under review

点击查看摘要

Abstract:Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{this https URL}{this https URL}.

15. 【2604.03117】Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

链接https://arxiv.org/abs/2604.03117

作者:Chengyin Hu,Yuxian Dong,Yikun Guo,Xiang Chen,Junqi Wu,Jiahuan Long,Yiwei Wei,Tingsong Jiang,Wen Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains largely unexplored, attacks remains largely, adversarial attacks remains, Infrared vision-language models, low-visibility environments

备注

点击查看摘要

Abstract:Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

16. 【2604.03114】Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

链接https://arxiv.org/abs/2604.03114

作者:Zhangyun Tan,Zeliang Zhang,Susan Liang,Yolo Yunlong Tang,Lisha Chen,Chenliang Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:web-scale data retain, data retain sensitive, require removing, trained on web-scale, web-scale data

备注

点击查看摘要

Abstract:VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.03114 [cs.CV]

(or
arXiv:2604.03114v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.03114

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2604.03094】A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification

链接https://arxiv.org/abs/2604.03094

作者:David Mike-Ewewie,Panhapiseth Lim,Priyanka Kumar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Synthetic Aperture Radar, Accurate and automated, automated sea ice, sea ice classification, classification is important

备注

点击查看摘要

Abstract:Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.

18. 【2604.03072】MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

链接https://arxiv.org/abs/2604.03072

作者:Jiameng Li,Aleksei Tiulpin,Matthew B. Blaschko

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, multimodal large language, language models, compared with text, multimodal large

备注: 9 pages

点击查看摘要

Abstract:For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

19. 【2604.03069】SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

链接https://arxiv.org/abs/2604.03069

作者:Zicheng Zhang,Xiangting Meng,Ke Wu,Wenchao Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent progress, Gaussian Splatting, notably improved rendering, notably improved, improved rendering quality

备注

点击查看摘要

Abstract:Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22% of the Gaussians and maintain reasonable rendering quality with only 1.5% of the Gaussians. Project page: this https URL.

20. 【2604.03064】Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

链接https://arxiv.org/abs/2604.03064

作者:Joé Napolitano,Pascal Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Frechet Inception Distance, generated images remains, generative modeling, remains a fundamental, fundamental challenge

备注: 13 pages, 15 figures, 2 tables. Preprint

点击查看摘要

Abstract:Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

21. 【2604.03061】Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

链接https://arxiv.org/abs/2604.03061

作者:Weixiong Sun,Xiang Yin,Chao Dong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Nano Banana, raise the question, image restoration, Recent

备注

点击查看摘要

Abstract:Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. In this work, we conduct a systematic evaluation of Nano Banana 2 for image restoration across diverse scenes and degradation types. Our results show that prompt design plays a critical role, where concise prompts with explicit fidelity constraints achieve the best trade-off between reconstruction accuracy and perceptual quality. Compared with state-of-the-art restoration models, Nano Banana 2 achieves superior performance in full-reference metrics while remaining competitive in perceptual quality, which is further supported by user studies. We also observe strong generalization in challenging scenarios, such as small faces, dense crowds, and severe degradations. However, the model remains sensitive to prompt formulation and may require iterative refinement for optimal results. Overall, our findings suggest that general-purpose generative models hold strong potential as unified image restoration solvers, while highlighting the importance of controllability and robustness. All test results are available on this https URL.

22. 【2604.03045】STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

链接https://arxiv.org/abs/2604.03045

作者:Linfeng Fan,Yuan Tian,Ziwei Li,Zhiwu Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Large Language Models, Video Large Language, Language Models, Large Language, generating visually unsupported

备注: Preprint

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

23. 【2604.03040】QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

链接https://arxiv.org/abs/2604.03040

作者:Lokman Bekit,Hamza Karim,Nghia T Nguyen,Yasin Yilmaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Anomaly Detection, Video Anomaly, Anomaly Detection, computer vision, fundamental challenge

备注

点击查看摘要

Abstract:Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

24. 【2604.03039】GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model

链接https://arxiv.org/abs/2604.03039

作者:Qida Cao,Xinyuan Hu,Changyue Shi,Jiajun Ding,Zhou Yu,Jun Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper describes, Restoration and Reconstruction, smoke-degraded images, Reconstruction, smoke reduces image

备注

点击查看摘要

Abstract:This paper describes our method for Track 2 of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge on smoke-degraded images. In this task, smoke reduces image visibility and weakens the cross-view consistency required by scene optimization and rendering. We address this problem with a multi-stage pipeline consisting of image restoration, dehazing, MLLM-based enhancement, 3DGS-MCMC optimization, and averaging over repeated runs. The main purpose of the pipeline is to improve visibility before rendering while limiting scene-content changes across input views. Experimental results on the challenge benchmark show improved quantitative performance and better visual quality than the provided baselines. The code is available at this https URL. Our method achieved a ranking of 1 out of 14 participants in Track 2 of the NTIRE 3DRR Challenge, as reported on the official competition website: this https URL.

25. 【2604.03037】ARM: Advantage Reward Modeling for Long-Horizon Manipulation

链接https://arxiv.org/abs/2604.03037

作者:Yiming Mao,Zixi Yu,Weixin Mao,Yinhao Li,Qirui Hu,Zihan Lan,Minzhao Zhu,Hua Chen

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:robotic manipulation remains, provide limited guidance, sparse rewards provide, rewards provide limited, Long-horizon robotic manipulation

备注

点击查看摘要

Abstract:Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.

26. 【2604.03002】Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition

链接https://arxiv.org/abs/2604.03002

作者:Seoyeon Ko,Yeojin Song,Egene Chung,Luca Quagliato,Taeyong Lee,Junhyug Noh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gait recognizers excel, Skeleton-based gait recognizers, modeling spatial configurations, underuse explicit motion, Wavelet Feature Stream

备注: 5 pages, 1 figure, to appear in ICASSP 2026

点击查看摘要

Abstract:Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.

27. 【2604.02996】Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

链接https://arxiv.org/abs/2604.02996

作者:Weiquan Wang,Jun Xiao,Feifei Shao,Yi Yang,Yueting Zhuang,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing dynamic scenes, multiple interacting humans, Reconstructing dynamic, creating high-fidelity digital, high-fidelity digital twins

备注

点击查看摘要

Abstract:Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

28. 【2604.02979】Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

链接https://arxiv.org/abs/2604.02979

作者:Hanshuai Cui,Zhiqing Tang,Zhi Yao,Fanshuai Meng,Weijia Jia,Wei Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:repeated multi-step denoising, models enable long-form, remain expensive due, diffusion models enable, enable long-form video

备注

点击查看摘要

Abstract:Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

29. 【2604.02977】Effect of Input Resolution on Retinal Vessel Segmentation Performance: An Empirical Study Across Five Datasets

链接https://arxiv.org/abs/2604.02977

作者:Amarnath R

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:satisfy GPU memory, GPU memory constraints, uniform batch processing, deep learning pipelines, enable uniform batch

备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Most deep learning pipelines for retinal vessel segmentation resize fundus images to satisfy GPU memory constraints and enable uniform batch processing. However, the impact of this resizing on thin vessel detection remains underexplored. When high resolution images are downsampled, thin vessels are reduced to subpixel structures, causing irreversible information loss even before the data enters the network. Standard volumetric metrics such as the Dice score do not capture this loss because thick vessel pixels dominate the evaluation. We investigated this effect by training a baseline UNet at multiple downsampling ratios across five fundus datasets (DRIVE, STARE, CHASE_DB1, HRF, and FIVES) with native widths ranging from 565 to 3504 pixels, keeping all other settings fixed. We introduce a width-stratified sensitivity metric that evaluates thin (half-width 3 pixels), medium (3 to 7 pixels), and thick (7 pixels) vessel detection separately, using native resolution width estimates derived from a Euclidean distance transform. Results show that for high-resolution datasets (HRF, FIVES), thin vessel sensitivity improves monotonically as images are downsampled toward the encoder's effective operating range, peaking at processed widths between 256 and 876 pixels. For low-to-mid resolution datasets (DRIVE, STARE, CHASE_DB1), thin vessel sensitivity is highest at or near native resolution and degrades with any downsampling. Across all five datasets, aggressive downsampling reduced thin vessel sensitivity by up to 15.8 percentage points (DRIVE) while Dice remained relatively stable, confirming that Dice alone is insufficient for evaluating microvascular segmentation.

30. 【2604.02973】Exploring Motion-Language Alignment for Text-driven Motion Generation

链接https://arxiv.org/abs/2604.02973

作者:Ruxi Gu,Zilei Wang,Wei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:follow textual descriptions, Text-driven human motion, realistic motion sequences, synthesize realistic motion, Text-driven human

备注: 10 pages, 8 figures

点击查看摘要

Abstract:Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

31. 【2604.02966】Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

链接https://arxiv.org/abs/2604.02966

作者:Wenhao Li,Zimeng Wu,Yu Wu,Zehua Fu,Jiaxin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned aerial vehicle, dynamically changing scenarios, limited annotated training, Unmanned aerial, annotated training data

备注: CVPR2026 Accepted

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at this https URL.

32. 【2604.02956】Collaborative Multi-Mode Pruning for Vision-Language Models

链接https://arxiv.org/abs/2604.02956

作者:Zimeng Wu,Yunhong Wang,Donghao Wang,Jiaxin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified Transformer architecture, resource-constrained devices remains, devices remains challenging, remains challenging due, high computational complexity

备注: CVPR2026 Accepted

点击查看摘要

Abstract:Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at this https URL.

33. 【2604.02948】CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation

链接https://arxiv.org/abs/2604.02948

作者:Zelin Zhang,Kedi Li,Huiqi Liang,Tao Zhang,Chuanzhi Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse sensing modalities, shown great potential, leveraging complementary information, sensing modalities, shown great

备注

点击查看摘要

Abstract:Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

34. 【2604.02946】Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

链接https://arxiv.org/abs/2604.02946

作者:Koshiro Nagano,Ryo Fujii,Ryo Hachiuma,Fumiaki Sato,Taiki Sekii,Hideo Saito

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:reducing collection costs, collection costs, training data synthesis, attracted attention, effective approach

备注: CVPR 2026

点击查看摘要

Abstract:Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.

35. 【2604.02941】MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

链接https://arxiv.org/abs/2604.02941

作者:Bin Liu,Zhixiang Xiong,Zhifen He,Bo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Speech-driven three-dimensional, animation synthesis aims, facial motion signals, facial animation synthesis, facial motion

备注: 9 pages

点击查看摘要

Abstract:Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

Comments:
9 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.02941 [cs.CV]

(or
arXiv:2604.02941v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.02941

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2604.02935】Modality-Specific Hierarchical Enhancement for RGB-D Camouflaged Object Detection

链接https://arxiv.org/abs/2604.02935

作者:Yuzhen Niu,Yangqing Wang,Ri Cheng,Fusheng Li,Rongshen Wang,Zhichen Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Camouflaged object detection, high target-background similarity, RGB-D COD methods, Camouflaged object, Hierarchical Enhancement Module

备注: 11 pages, 7 figures, including supplementary material. Accepted by IEEE ICME 2026

点击查看摘要

Abstract:Camouflaged object detection (COD) is challenging due to high target-background similarity, and recent methods address this by complementarily using RGB-D texture and geometry cues. However, RGB-D COD methods still underutilize modality-specific cues, which limits fusion quality. We believe this is because RGB and depth features are fused directly after backbone extraction without modality-specific enhancement. To address this limitation, we propose MHENet, an RGB-D COD framework that performs modality-specific hierarchical enhancement and adaptive fusion of RGB and depth features. Specifically, we introduce a Texture Hierarchical Enhancement Module (THEM) to amplify subtle texture variations by extracting high-frequency information and a Geometry Hierarchical Enhancement Module (GHEM) to enhance geometric structures via learnable gradient extraction, while preserving cross-scale semantic consistency. Finally, an Adaptive Dynamic Fusion Module (ADFM) adaptively fuses the enhanced texture and geometry features with spatially varying weights. Experiments on four benchmarks demonstrate that MHENet surpasses 16 state-of-the-art methods qualitatively and quantitatively. Code is available at this https URL.

37. 【2604.02934】PolyReal: A Benchmark for Real-World Polymer Science Workflows

链接https://arxiv.org/abs/2604.02934

作者:Wanhao Liu,Weida Wang,Jiaqing Xie,Suorong Yang,Jue Wang,Benteng Chen,Guangtao Mei,Zonglin Yang,Shufei Zhang,Yuchun Mo,Lang Cheng,Jin Zeng,Houqiang Li,Wanli Ouyang,Yuqiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, Language Models

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.

38. 【2604.02930】BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

链接https://arxiv.org/abs/2604.02930

作者:Miguel Antunes-García,Santiago Montiel-Marín,Fabio Sánchez-García,Rodrigo Gutiérrez-Moreno,Rafael Barea,Luis M. Bergasa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous Driving systems, accurately detect, surrounding obstacles, evolve is essential, predict the behaviour

备注: 15 pages, 5 figures

点击查看摘要

Abstract:A robust awareness of how dynamic scenes evolve is essential for Autonomous Driving systems, as they must accurately detect, track, and predict the behaviour of surrounding obstacles. Traditional perception pipelines that rely on modular architectures tend to suffer from cumulative errors and latency. Instance Prediction models provide a unified solution, performing Bird's-Eye-View segmentation and motion estimation across current and future frames using information directly obtained from different sensors. However, a key challenge in these models lies in the effective processing of the dense spatial and temporal information inherent in dynamic driving environments. This level of complexity demands architectures capable of capturing fine-grained motion patterns and long-range dependencies without compromising real-time performance. We introduce BEVPredFormer, a novel camera-only architecture for BEV instance prediction that uses attention-based temporal processing to improve temporal and spatial comprehension of the scene and relies on an attention-based 3D projection of the camera information. BEVPredFormer employs a recurrent-free design that incorporates gated transformer layers, divided spatio-temporal attention mechanisms, and multi-scale head tasks. Additionally, we incorporate a difference-guided feature extraction module that enhances temporal representations. Extensive ablation studies validate the effectiveness of each architectural component. When evaluated on the nuScenes dataset, BEVPredFormer was on par or surpassed State-Of-The-Art methods, highlighting its potential for robust and efficient Autonomous Driving perception.

39. 【2604.02915】GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes

链接https://arxiv.org/abs/2604.02915

作者:Mijeong Kim,Jungtaek Kim,Bohyung Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrates Gaussian Processes, Gaussian Splatting, Gaussian Processes, dynamic scenes, framework that integrates

备注: CVPR 2026, Page: [this https URL](https://cv.snu.ac.kr/research/GP4DGS)

点击查看摘要

Abstract:We present GP-4DGS, a novel framework that integrates Gaussian Processes (GPs) into 4D Gaussian Splatting (4DGS) for principled probabilistic modeling of dynamic scenes. While existing 4DGS methods focus on deterministic reconstruction, they are inherently limited in capturing motion ambiguity and lack mechanisms to assess prediction reliability. By leveraging the kernel-based probabilistic nature of GPs, our approach introduces three key capabilities: (i) uncertainty quantification for motion predictions, (ii) motion estimation for unobserved or sparsely sampled regions, and (iii) temporal extrapolation beyond observed training frames. To scale GPs to the large number of Gaussian primitives in 4DGS, we design spatio-temporal kernels that capture the correlation structure of deformation fields and adopt variational Gaussian Processes with inducing points for tractable inference. Our experiments show that GP-4DGS enhances reconstruction quality while providing reliable uncertainty estimates that effectively identify regions of high motion ambiguity. By addressing these challenges, our work takes a meaningful step toward bridging probabilistic modeling and neural graphics.

40. 【2604.02908】SentiAvatar: Towards Expressive and Interactive Digital Humans

链接https://arxiv.org/abs/2604.02908

作者:Chuhao Jin,Rui Zhang,Qingzhe Gao,Haoyu Shi,Dayu Wu,Yichen Jiang,Yihan Wu,Ruihua Song

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

关键词:building expressive interactive, digital humans, expressive interactive, create SuSu, framework for building

备注: 19 pages, 4 figures

点击查看摘要

Abstract:We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at this https URL.

41. 【2604.02905】UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

链接https://arxiv.org/abs/2604.02905

作者:Geonuk Kim,Minhoi Kim,Kangil Lee,Minsu Kim,Hyeonseong Jeon,Jeonghoon Han,Hyoungjoon Lim,Junho Yim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing approaches operate, recognizing unprecedented defects, industrial inspection systems, closed-set assumption, detecting novel anomalies

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Although industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While visual prompting offers a scalable alternative for industrial inspection, existing methods often suffer from prompt embedding collapse due to high intra-class variance and subtle inter-class differences. To resolve this, we propose UniSpector, which shifts the focus from naive prompt-to-region matching to the principled design of a semantically structured and transferable prompt topology. UniSpector employs the Spatial-Spectral Prompt Encoder to extract orientation-invariant, fine-grained representations; these serve as a solid basis for the Contrastive Prompt Encoder to explicitly regularize the prompt space into a semantically organized angular manifold. Additionally, Prompt-guided Query Selection generates adaptive object queries aligned with the prompt. We introduce Inspect Anything, the first benchmark for visual-prompt-based open-set defect localization, where UniSpector significantly outperforms baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enable a scalable, retraining-free inspection paradigm for continuously evolving industrial environments, while offering critical insights into the design of generic visual prompting.

42. 【2604.02903】RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

链接https://arxiv.org/abs/2604.02903

作者:Cheng Lu,Mingqian Ji,Shanshan Zhang,Zhihao Li,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:object detection remains, making reliable context, detection remains challenging, object detection, making reliable

备注

点击查看摘要

Abstract:Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40--50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

43. 【2604.02896】EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

链接https://arxiv.org/abs/2604.02896

作者:Chunyang Cheng,Tianyang Xu,Xiao-Jun Wu,Tao Zhou,Hui Li,Zhangyong Tang,Josef Kittler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:proper adaptation, image fusion research, image fusion, fusion, Evaluation

备注: 20 figures,accepted by TPAMI

点击查看摘要

Abstract:Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at this https URL.

44. 【2604.02893】oward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

链接https://arxiv.org/abs/2604.02893

作者:Hai Nguyen-Truong,Alper Balbay,Tunga Bayrak

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural language description, referred geometric element, study visual explanation, Referring Image Segmentation, natural image benchmarks

备注: 12 pages, 7 figures

点击查看摘要

Abstract:We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to 1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.

45. 【2604.02891】Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

链接https://arxiv.org/abs/2604.02891

作者:Yufei Yin,Yuchen Xing,Qianke Meng,Minghao Chen,Yan Yang,Zhou Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tight compute budgets, requires extracting query-relevant, extracting query-relevant information, long videos requires, videos requires extracting

备注: Accepted to ICME 2026

点击查看摘要

Abstract:Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.

46. 【2604.02883】Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision

链接https://arxiv.org/abs/2604.02883

作者:Zhenxiao Liang,Qixing Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pose-dependent temporal flicker, Editing animatable human, animatable human avatars, human avatars typically, avatars typically relies

备注

点击查看摘要

Abstract:Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

47. 【2604.02880】InstructTable: Improving Table Structure Recognition Through Instructions

链接https://arxiv.org/abs/2604.02880

作者:Boming Chen,Zining Wang,Zhentao Guo,Jianqiang Liu,Chen Duan,Yu Gu,Kai zhou,Pengfei Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:holds widespread practical, widespread practical importance, encounters significant challenges, layouts involving merged, processing complex layouts

备注: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)

点击查看摘要

Abstract:Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

48. 【2604.02877】Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework

链接https://arxiv.org/abs/2604.02877

作者:Yu Zhu,Kang Li,Zheng Li,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:surgical video scene, recent studies incrementally, studies incrementally update, backward knowledge transfer, video scene parsing

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.

49. 【2604.02871】SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

链接https://arxiv.org/abs/2604.02871

作者:Tomoyasu Nanaumi,Yukino Tsuzuki,Junichi Okubo,Junichiro Fujii,Takayoshi Yamashita

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frozen foundation model, unseen target categories, foundation model features, target-domain adaptation, labeled auxiliary dataset

备注: 14 pages, 6 figures, 9 tables

点击查看摘要

Abstract:We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

50. 【2604.02870】oken Warping Helps MLLMs Look from Nearby Viewpoints

链接https://arxiv.org/abs/2604.02870

作者:Phillip Y. Lee,Chanho Park,Mingue Park,Seungwoo Yoo,Juil Koo,Minhyuk Sung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, multimodal large language, language models, multimodal large, large language

备注: CVPR 2026, Project Page: [this https URL](https://token-warping-mllm.github.io)

点击查看摘要

Abstract:Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

51. 【2604.02867】HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

链接https://arxiv.org/abs/2604.02867

作者:Leyang Jin,Yujian Zheng,Bingkui Tong,Yuda Qiu,Zhenyu Xie,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing strand-level, highly challenging, image is highly, preserving consistent, consistent and realistic

备注: 17 pages, 6 figures

点击查看摘要

Abstract:Reconstructing strand-level 3D hair from a single-view image is highly challenging, especially when preserving consistent and realistic attributes in unseen regions. Existing methods rely on limited frontal-view cues and small-scale/style-restricted synthetic data, often failing to produce satisfactory results in invisible regions. In this work, we propose a novel framework that leverages the strong 3D priors of video generation models to transform single-view hair reconstruction into a calibrated multi-view reconstruction task. To balance reconstruction quality and efficiency for the reformulated multi-view task, we further introduce a neural orientation extractor trained on sparse real-image annotations for better full-view orientation estimation. In addition, we design a two-stage strand-growing algorithm based on a hybrid implicit field to synthesize the 3D strand curves with fine-grained details at a relatively fast speed. Extensive experiments demonstrate that our method achieves state-of-the-art performance on single-view 3D hair strand reconstruction on a diverse range of hair portraits in both visible and invisible regions.

52. 【2604.02860】A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

链接https://arxiv.org/abs/2604.02860

作者:Allen He,Qi Liu,Kun Liu,Xinchen Liu,Wu Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Temporal sentence grounding, temporal segment, aims to localize, TSGV, segment that semantically

备注: Accepted as CVPR 2026 Workshop PVUW

点击查看摘要

Abstract:Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

53. 【2604.02847】HiDiGen: Hierarchical Diffusion for B-Rep Generation with Explicit Topological Constraints

链接https://arxiv.org/abs/2604.02847

作者:Shurui Liu,Weide Chen,Ancong Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Boundary representation, encoding both geometric, geometric primitives, valid B-rep structures, B-rep structures remains

备注

点击查看摘要

Abstract:Boundary representation (B-rep) is the standard 3D modeling format in CAD systems, encoding both geometric primitives and topological connectivity. Despite its prevalence, deep generative modeling of valid B-rep structures remains challenging due to the intricate interplay between discrete topology and continuous geometry. In this paper, we propose HiDiGen, a hierarchical generation framework that decouples geometry modeling into two stages, each guided by explicitly modeled topological constraints. Specifically, our approach first establishes face-edge incidence relations to define a coherent topological scaffold, upon which face proxies and initial edge curves are generated. Subsequently, multiple Transformer-based diffusion modules are employed to refine the geometry by generating precise face surfaces and vertex positions, with edge-vertex adjacencies dynamically established and enforced to preserve structural consistency. This progressive geometry hierarchy enables the generation of more novel and diverse shapes, while two-stage topological modeling ensures high validity. Experimental results show that HiDiGen achieves strong performance, generating novel, diverse, and topologically sound CAD models.

54. 【2604.02846】Adaptive Local Frequency Filtering for Fourier-Encoded Implicit Neural Representations

链接https://arxiv.org/abs/2604.02846

作者:Ligen Shi,Jun Qiu,Yuhang Zheng,Chang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:shown strong capability, modeling continuous signals, Fourier-encoded implicit neural, Fourier-encoded INRs, discrete samples

备注: 12 pages, 8 figures

点击查看摘要

Abstract:Fourier-encoded implicit neural representations (INRs) have shown strong capability in modeling continuous signals from discrete samples. However, conventional Fourier feature mappings use a fixed set of frequencies over the entire spatial domain, making them poorly suited to signals with spatially varying local spectra and often leading to slow convergence of high-frequency details. To address this issue, we propose an adaptive local frequency filtering method for Fourier-encoded INRs. The proposed method introduces a spatially varying parameter $\alpha(\mathbf{x})$ to modulate encoded Fourier components, enabling a smooth transition among low-pass, band-pass, and high-pass behaviors at different spatial locations. We further analyze the effect of the proposed filter from the neural tangent kernel (NTK) perspective and provide an NTK-inspired interpretation of how it reshapes the effective kernel spectrum. Experiments on 2D image fitting, 3D shape representation, and sparse data reconstruction demonstrate that the proposed method consistently improves reconstruction quality and leads to faster optimization compared with fixed-frequency baselines. In addition, the learned $\alpha(\mathbf{x})$ provides an intuitive visualization of spatially varying frequency preferences, which helps explain the behavior of the model on non-stationary signals. These results indicate that adaptive local frequency modulation is a practical enhancement for Fourier-encoded INRs.

55. 【2604.02845】Deformation-based In-Context Learning for Point Cloud Understanding

链接https://arxiv.org/abs/2604.02845

作者:Chengxing Lin,Jinhong Deng,Yinjie Lei,Wen Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cloud In-Context Learning, point cloud ICL, strong multitask capabilities, demonstrated strong multitask, Recent advances

备注: Accepted by CVPR 2026. Code: [this https URL](https://github.com/linchengxing/DeformPIC)

点击查看摘要

Abstract:Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training-inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.

56. 【2604.02836】Factorized Multi-Resolution HashGrid for Efficient Neural Radiance Fields: Execution on Edge-Devices

链接https://arxiv.org/abs/2604.02836

作者:Kim Jun-Seong,Mingyu Kim,GeonU Kim,Tae-Hyun Oh,Jin-Hwa Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural radiance fields, radiance fields, neural radiance, on-device neural radiance, large application fields

备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:We introduce Fact-Hash, a novel parameter-encoding method for training on-device neural radiance fields. Neural Radiance Fields (NeRF) have proven pivotal in 3D representations, but their applications are limited due to large computational resources. On-device training can open large application fields, providing strength in communication limitations, privacy concerns, and fast adaptation to a frequently changing scene. However, challenges such as limited resources (GPU memory, storage, and power) impede their deployment. To handle this, we introduce Fact-Hash, a novel parameter-encoding merging Tensor Factorization and Hash-encoding techniques. This integration offers two benefits: the use of rich high-resolution features and the few-shot robustness. In Fact-Hash, we project 3D coordinates into multiple lower-dimensional forms (2D or 1D) before applying the hash function and then aggregate them into a single feature. Comparative evaluations against state-of-the-art methods demonstrate Fact-Hash's superior memory efficiency, preserving quality and rendering speed. Fact-Hash saves memory usage by over one-third while maintaining the PSNR values compared to previous encoding methods. The on-device experiment validates the superiority of Fact-Hash compared to alternative positional encoding methods in computational efficiency and energy consumption. These findings highlight Fact-Hash as a promising solution to improve feature grid representation, address memory constraints, and improve quality in various applications. Project page: this https URL

57. 【2604.02829】STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

链接https://arxiv.org/abs/2604.02829

作者:Hao Ren,Zetong Bi,Yiming Zeng,Zhaoliang Wan,Lu Qi,Hui Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:requires the robot, robot to reach, first-person visual observations, Visual navigation requires, Visual

备注: CVPR2026

点击查看摘要

Abstract:Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at \href{this https URL}{this https URL}.

58. 【2604.02828】NavCrafter: Exploring 3D Scenes from a Single Image

链接https://arxiv.org/abs/2604.02828

作者:Hongbo Duan,Peiyu Zhuang,Yi Liu,Zhengyang Zhang,Yuxin Zhang,Pengting Luo,Fangming Liu,Xueqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Creating flexible, vital when direct, data acquisition, costly or impractical, single image

备注: 8 pages accepted by ICRA 2026

点击查看摘要

Abstract:Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

59. 【2604.02817】MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

链接https://arxiv.org/abs/2604.02817

作者:Shubo Lin,Xuanyang Zhang,Wei Cheng,Weiming Hu,Gang Yu,Jin Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visually stunning content, generating visually stunning, yield physically inconsistent, physically inconsistent results, inconsistent results due

备注: Project Page: [this https URL](https://shubolin028.github.io/MMPhysVideo-Page)

点击查看摘要

Abstract:Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

60. 【2604.02816】QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

链接https://arxiv.org/abs/2604.02816

作者:Xinhao Wang,Zhonyu Xia,Zhiwei Lin,Zhe Li,Yongtao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Multimodal Large, Language Models, Large Language

备注: 12 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

61. 【2604.02808】CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification

链接https://arxiv.org/abs/2604.02808

作者:Haoxuan Xu,Hanzi Wang,Guanglin Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Person Re-Identification, faces severe challenges, long-term surveillance scenario, surveillance scenario, faces severe

备注

点击查看摘要

Abstract:Person Re-Identification (ReID) faces severe challenges from modality discrepancy and clothing variation in long-term surveillance scenario. While existing studies have made significant progress in either Visible-Infrared ReID (VI-ReID) or Clothing-Change ReID (CC-ReID), real-world surveillance system often face both challenges simultaneously. To address this overlooked yet realistic problem, we define a new task, termed Cross-Modality Clothing-Change Re-Identification (CMCC-ReID), which targets pedestrian matching across variations in both modality and clothing. To advance research in this direction, we construct a new benchmark SYSU-CMCC, where each identity is captured in both visible and infrared domains with distinct outfits, reflecting the dual heterogeneity of long-term surveillance. To tackle CMCC-ReID, we propose a Progressive Identity Alignment Network (PIA) that progressively mitigates the issues of clothing variation and modality discrepancy. Specifically, a Dual-Branch Disentangling Learning (DBDL) module separates identity-related cues from clothing-related factors to achieve clothing-agnostic representation, and a Bi-Directional Prototype Learning (BPL) module performs intra-modality and inter-modality contrast in the embedding space to bridge the modality gap while further suppressing clothing interference. Extensive experiments on the SYSU-CMCC dataset demonstrate that PIA establishes a strong baseline for this new task and significantly outperforms existing methods.

62. 【2604.02804】PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

链接https://arxiv.org/abs/2604.02804

作者:Dexiang Li,Zhenning Che,Haijun Zhang,Dongliang Zhou,Zhao Zhang,Yahong Han

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:Pavement condition assessment, condition assessment, assessment is essential, essential for road, road safety

备注

点击查看摘要

Abstract:Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: this https URL.

63. 【2604.02799】UNICA: A Unified Neural Framework for Controllable 3D Avatars

链接https://arxiv.org/abs/2604.02799

作者:Jiahe Zhu,Xinyao Wang,Yiyu Zhuang,Yanwen Wang,Jing Tian,Yao Yao,Hao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:found widespread applications, found widespread, widespread applications, neural Controllable Avatar, physical simulation

备注: Opensource code: [this https URL](https://github.com/zjh21/UNICA)

点击查看摘要

Abstract:Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar's geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of "motion planning, rigging, physical simulation, and rendering". Code is released at this https URL.

64. 【2604.02787】LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers

链接https://arxiv.org/abs/2604.02787

作者:Shreshth Saini,Hakan Gedik,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Standard Dynamic Range, High Dynamic Range, Dynamic Range, Standard Dynamic, High Dynamic

备注

点击查看摘要

Abstract:The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

65. 【2604.02785】CANDLE: Illumination-Invariant Semantic Priors for Color Ambient Lighting Normalization

链接https://arxiv.org/abs/2604.02785

作者:Rong-Lin Jian,Ting-Yao Chen,Yu-Fan Lin,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe chromatic shifts, highlight saturation, Color Ambient Normalization, material-dependent reflectance, Color ambient lighting

备注: CVPRW 2026 Camera Ready; NTIRE 2026 Ambient Lighting Normalization (2nd 3rd in Color White Light Track)

点击查看摘要

Abstract:Color ambient lighting normalization under multi-colored illumination is challenging due to severe chromatic shifts, highlight saturation, and material-dependent reflectance. Existing geometric and low-level priors are insufficient for recovering object-intrinsic color when illumination-induced chromatic bias dominates. We observe that DINOv3's self-supervised features remain highly consistent between colored-light inputs and ambient-lit ground truth, motivating their use as illumination-robust semantic priors. We propose CANDLE (Color Ambient Normalization with DINO Layer Enhancement), which introduces DINO Omni-layer Guidance (D.O.G.) to adaptively inject multi-layer DINOv3 features into successive encoder stages, and a color-frequency refinement design (BFACG + SFFB) to suppress decoder-side chromatic collapse and detail contamination. Experiments on CL3AN show a +1.22 dB PSNR gain over the strongest prior method. CANDLE achieves 3rd place on the NTIRE 2026 ALN Color Lighting Challenge and 2nd place in fidelity on the White Lighting track with the lowest FID, confirming strong generalization across both chromatic and luminance-dominant illumination conditions. Code is available at this https URL.

66. 【2604.02784】EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

链接https://arxiv.org/abs/2604.02784

作者:Ryuhei Miyazato,Shunsuke Kitada,Kei Harada

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:input image, remain vulnerable, factually incorrect, incorrect or ungrounded, Vision-Language Models

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

67. 【2604.02780】A Unified Perspective on Adversarial Membership Manipulation in Vision Models

链接https://arxiv.org/abs/2604.02780

作者:Ruize Gao,Kaiwen Zhou,Yongqiang Chen,Feng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:specific data point, evaluating privacy leakage, model training set, aim to determine, training set

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model's training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the "member" region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature - a characteristic gradient-norm collapse trajectory - that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

68. 【2604.02773】Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

链接https://arxiv.org/abs/2604.02773

作者:Haoran Zhu,Wen Yang,Guangyou Yang,Chang Xu,Ruixiang Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Small object detection, ambiguous object boundaries, Small object, remains challenging due, extremely limited pixels

备注

点击查看摘要

Abstract:Small object detection (SOD) remains challenging due to extremely limited pixels and ambiguous object boundaries. These characteristics lead to challenging annotation, limited availability of large-scale high-quality datasets, and inherently weak semantic representations for small objects. In this work, we first address the data limitation by introducing TinySet-9M, the first large-scale, multi-domain dataset for small object detection. Beyond filling the gap in large-scale datasets, we establish a benchmark to evaluate the effectiveness of existing label-efficient detection methods for small objects. Our evaluation reveals that weak visual cues further exacerbate the performance degradation of label-efficient methods in small object detection, highlighting a critical challenge in label-efficient SOD. Secondly, to tackle the limitation of insufficient semantic representation, we move beyond training-time feature enhancement and propose a new paradigm termed Point-Prompt Small Object Detection (P2SOD). This paradigm introduces sparse point prompts at inference time as an efficient information bridge for category-level localization, enabling semantic augmentation. Building upon the P2SOD paradigm and the large-scale TinySet-9M dataset, we further develop DEAL (DEtect Any smalL object), a scalable and transferable point-prompted detection framework that learns robust, prompt-conditioned representations from large-scale data. With only a single click at inference time, DEAL achieves a 31.4% relative improvement over fully supervised baselines under strict localization metrics (e.g., AP75) on TinySet-9M, while generalizing effectively to unseen categories and unseen datasets. Our project is available at this https URL.

69. 【2604.02764】InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging

链接https://arxiv.org/abs/2604.02764

作者:Leyang Jin,Zirong Jin,Zisheng Ye,Haokai Pang,Xiaoguang Han,Yujian Zheng,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human digitization research, Recovering sewing patterns, sewing patterns, parametric sewing patterns, digitization research

备注: 13 pages, 13 figures

点击查看摘要

Abstract:Recovering sewing patterns from draped 3D garments is a challenging problem in human digitization research. In contrast to the well-studied forward process of draping designed sewing patterns using mature physical simulation engines, the inverse process of recovering parametric 2D patterns from deformed garment geometry remains fundamentally ill-posed for existing methods. We propose a two-stage framework that centers on a structured intermediate representation, BoxMesh, which serves as the key to bridging the gap between 3D garment geometry and parametric sewing patterns. BoxMesh encodes both garment-level geometry and panel-level structure in 3D, while explicitly disentangling intrinsic panel geometry and stitching topology from draping-induced deformations. This representation imposes a physically grounded structure on the problem, significantly reducing ambiguity. In Stage I, a geometry-driven autoregressive model infers BoxMesh from the input 3D garment. In Stage II, a semantics-aware autoregressive model parses BoxMesh into parametric sewing patterns. We adopt autoregressive modeling to naturally handle the variable-length and structured nature of panel configurations and stitching relationships. This decomposition separates geometric inversion from structured pattern inference, leading to more accurate and robust recovery. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the GarmentCodeData benchmark and generalizes effectively to real-world scans and single-view images.

70. 【2604.02753】DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

链接https://arxiv.org/abs/2604.02753

作者:Siheng Wang,Yanshu Li,Bohan Hu,Zhengdao Li,Haibo Zhan,Linshan Li,Weiming Liu,Ruizhi Qian,Guangxin Wu,Hao Zhang,Jifeng Shen,Piotr Koniusz,Zhengtao Yao,Junhao Dong,Qiang Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-vocabulary Object Detection, Open-vocabulary Object, existing approaches remain, approaches remain limited, recognize objects

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

71. 【2604.02752】Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

链接https://arxiv.org/abs/2604.02752

作者:Jinfan Liu,Wuze Zhang,Zhangli Hu,Zhehan Zhao,Ye Chen,Bingbing Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optimizers lack structural, lack structural awareness, produce unstructured layouts, local minima due, differentiable optimizers lack

备注

点击查看摘要

Abstract:In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30-50%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30-40% compared to existing differentiable vectorization methods.

72. 【2604.02748】Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

链接https://arxiv.org/abs/2604.02748

作者:Jonghun Kim,Sinyoung Ra,Hyunjin Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable capabilities, remarkable capabilities, capabilities in linguistic, increasingly adept, adept at vision-language

备注: ICPR 2026 accepted

点击查看摘要

Abstract:LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

73. 【2604.02736】HOM: Generating Physically Plausible Hand-Object Meshes From Text

链接https://arxiv.org/abs/2604.02736

作者:Uyoung Jeong,Yihalem Yimolal Tiruneh,Hyung Jin Chang,Seungryul Baek,Kwang In Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dexterous robotic grasping, high visual fidelity, requiring both high, content generation, crucial for dexterous

备注: accepted to CVPR Findings 2026

点击查看摘要

Abstract:The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

74. 【2604.02719】MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications

链接https://arxiv.org/abs/2604.02719

作者:Mirali Purohit,Bimal Gajera,Irish Mehta,Bhanu Tokas,Jacob Adler,Steven Lu,Scott Dickenshied,Serina Diniega,Brian Bue,Umaa Rebbapragada,Hannah Kerner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Mars remote sensing, remote sensing, Equal Validation Loss, Mars remote, Validation Loss

备注: Accepted at CVPR 2026 (Main Track)

点击查看摘要

Abstract:We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: this https URL.

75. 【2604.02714】ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2604.02714

作者:Zihao Sheng,Xin Ye,Jingru Luo,Sikai Chen,Liu Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown promising results, autonomous driving models, driving models based, learning driving policies, architectures have shown

备注: The code and demo will be publicly available at [this https URL](https://zihaosheng.github.io/ExploreVLA/)

点击查看摘要

Abstract:End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at this https URL.

76. 【2604.02710】V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

链接https://arxiv.org/abs/2604.02710

作者:Junwei You,Pei Li,Zhuoyu Jiang,Weizhe Tang,Zilin Huang,Rui Gan,Jiaxi Liu,Yan Zhao,Sikai Chen,Bin Ran

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, Multimodal large, large language models, systematically assess model, cooperative driving conditions

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: this https URL.

77. 【2604.02707】A Rapid Instrument Exchange System for Humanoid Robots in Minimally Invasive Surgery

链接https://arxiv.org/abs/2604.02707

作者:Bingcong Zhang,Yihang Lyv,Lianbo Ma,Yushi He,Pengfei Wei,Xingchi Liu,Jinhua Li,Jianchang Zhao,Lizhi Pan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:minimally invasive surgery, demonstrated immense potential, Humanoid robot technologies, invasive surgery, humanoid robots

备注

点击查看摘要

Abstract:Humanoid robot technologies have demonstrated immense potential for minimally invasive surgery (MIS). Unlike dedicated multi-arm surgical platforms, the inherent dual-arm configuration of humanoid robots necessitates an efficient instrument exchange capability to perform complex procedures, mimicking the natural workflow where surgeons manually switch instruments. To address this, this paper proposes an immersive teleoperated rapid instrument exchange system. The system utilizes a low-latency mechanism based on single-axis compliant docking and environmental constraint release. Integrated with real-time first-person view (FPV) perception via a head-mounted display (HMD), this framework significantly reduces operational complexity and cognitive load during the docking process. Comparative evaluations between experts and novices demonstrate high operational robustness and a rapidly converging learning curve; novice performance in instrument attachment and detachment improved substantially after brief training. While long-distance spatial alignment still presents challenges in time cost and collaborative stability, this study successfully validates the technical feasibility of humanoid robots executing stable instrument exchanges within constrained clinical environments.

78. 【2604.02696】VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping

链接https://arxiv.org/abs/2604.02696

作者:Yuhan Zhu,Yanyu Zhang,Jie Xu,Wei Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Gaussian Splatting SLAM, Bayesian Gaussian Splatting, variants typically rely, shown promising results, deterministic pose optimization

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.

79. 【2604.02695】XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis

链接https://arxiv.org/abs/2604.02695

作者:Shawn Young,Lijian Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Chest X-ray, complex clinical task, intelligence for automation, fundamental yet complex, task that increasingly

备注: 14 pages

点击查看摘要

Abstract:Chest X-ray (CXR) interpretation is a fundamental yet complex clinical task that increasingly relies on artificial intelligence for automation. However, traditional monolithic models often lack the nuanced reasoning required for trustworthy diagnosis, frequently leading to logical inconsistencies and diagnostic hallucinations. While multi-agent systems offer a potential solution by simulating collaborative consultations, existing frameworks remain susceptible to consensus-based errors when instantiated by a single underlying model. This paper introduces XrayClaw, a novel framework that operationalizes multi-agent alignment through a sophisticated cooperative-competitive architecture. XrayClaw integrates four specialized cooperative agents to simulate a systematic clinical workflow, alongside a competitive agent that serves as an independent auditor. To reconcile these distinct diagnostic pathways, we propose Competitive Preference Optimization, a learning objective that penalizes illogical reasoning by enforcing mutual verification between analytical and holistic interpretations. Extensive empirical evaluations on the MS-CXR-T, MIMIC-CXR, and CheXbench benchmarks demonstrate that XrayClaw achieves state-of-the-art performance in diagnostic accuracy, clinical reasoning fidelity, and zero-shot domain generalization. Our results indicate that XrayClaw effectively mitigates cumulative hallucinations and enhances the overall reliability of automated CXR diagnosis, establishing a new paradigm for trustworthy medical imaging analysis.

80. 【2604.02694】DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

链接https://arxiv.org/abs/2604.02694

作者:Fanwei Zeng,Changtao Miao,Jing Huang,Zhiya Tan,Shutao Gong,Xiaoming Yu,Yang Wang,Weibin Yao,Joey Tianyi Zhou,Jianshu Li,Yin Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enabled increasingly realistic, posing major challenges, increasingly realistic text-centric, posing major, document safety

备注: 10 pages, 4 figures, 5 tables. Preprint

点击查看摘要

Abstract:The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

81. 【2604.02692】Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

链接https://arxiv.org/abs/2604.02692

作者:Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate document parsing, robust content recognition, Accurate document, Document Layout Analysis, document parsing requires

备注

点击查看摘要

Abstract:Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.

82. 【2604.02689】Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

链接https://arxiv.org/abs/2604.02689

作者:Yuhui Lin,Siyue Yu,Yuxing Yang,Guangliang Cheng,Jimin Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Multimodal Large, enabling fine-grained spatial, fine-grained spatial understanding

备注

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: this https URL

83. 【2604.02654】Drift-Resilient Temporal Priors for Visual Tracking

链接https://arxiv.org/abs/2604.02654

作者:Yuqing Huang,Liting Lin,Weijun Zhuang,Zhenyu He,Xin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:naively aggregating noisy, noisy historical predictions, aggregating noisy historical, existing multi-frame trackers, Temporal Reliability Calibrator

备注: accepted by CVPR 2026

点击查看摘要

Abstract:Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.

84. 【2604.02639】Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

链接https://arxiv.org/abs/2604.02639

作者:Weimin Liu,Jiyuan Qiu,Wenjun Wang,Joshua H. Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Surround depth estimation, Surround depth, perception in autonomous, autonomous driving, cost-effective alternative

备注

点击查看摘要

Abstract:Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

85. 【2604.02627】Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

链接https://arxiv.org/abs/2604.02627

作者:Hao Li,Liwei Zou,Wenping Yin,Gulsen Taskin,Naoto Yokoya,Danfeng Hong,Wufan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:severe natural disasters, changing climate, human society, society now faces, faces more frequent

备注

点击查看摘要

Abstract:Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the "Golden 72 Hours" of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time-consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state-of-the-art vision Foundation Models (FMs) for rapid building damage mapping with post-earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel-wise Clustering (PC), ensuring robust prototype-level global feature alignment; second, a Distance-Penalized Triplet (DPT), integrating patch-level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye-Syria earthquake show promising performance in multiple cross-region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate-vulnerable regions and communities. The data and code are publicly available at this https URL.

86. 【2604.02616】Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis

链接https://arxiv.org/abs/2604.02616

作者:Guangyu Sun,Wenhan Wu,Zhishuai Guo,Ziteng Wang,Pegah Khosravi,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:objective clinical assessment, Automated recognition, children is essential, essential for early, early intervention

备注: Accepted on the CVPR 2026 Workshop on Computer Vision for Children (CV4CHL)

点击查看摘要

Abstract:Automated recognition of autistic behaviors in children is essential for early intervention and objective clinical assessment. However, the development of robust models is severely hindered by strict privacy regulations (e.g., HIPAA) and the sensitive nature of pediatric data, which prevents the centralized aggregation of clinical datasets. Furthermore, individual clinical sites often suffer from data scarcity, making it difficult to learn generalized behavior patterns or tailor models to site-specific patient distributions. To address these challenges, we observe that Federated Learning (FL) can decouple model training from raw data access, enabling multi-site collaboration while maintaining strict data residency. In this paper, we present the first study exploring Federated Learning for pose-based child autism behavior recognition. Our framework employs a two-layer privacy protection mechanism: utilizing human skeletal abstraction to remove identifiable visual information from the raw RGB videos and FL to ensure sensitive pose data remains within the clinic. This approach leverages distributed clinical data to learn generalized representations while providing the flexibility for site-specific personalization. Experimental results on the MMASD benchmark demonstrate that our framework achieves high recognition accuracy, outperforming traditional federated baselines and providing a robust, privacy-first solution for multi-site clinical analysis.

87. 【2604.02603】Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals

链接https://arxiv.org/abs/2604.02603

作者:Kunzhe Song,Geo Jie Zhou,Xiaoming Liu,Huacheng Zeng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robot navigation, critical for applications, autonomous driving, driving and robot, environmental perception

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Robust 3D environmental perception is critical for applications such as autonomous driving and robot navigation. However, optical sensors such as cameras and LiDAR often fail under adverse conditions, including smoke, fog, and non-ideal lighting. Although specialized radar systems can operate in these environments, their reliance on bespoke hardware and licensed spectrum limits scalability and cost-effectiveness. This paper introduces Rascene, an integrated sensing and communication (ISAC) framework that leverages ubiquitous mmWave OFDM communication signals for 3D scene imaging. To overcome the sparse and multipath-ambiguous nature of individual radio frames, Rascene performs multi-frame, spatially adaptive fusion with confidence-weighted forward projection, enabling the recovery of geometric consensus across arbitrary poses. Experimental results demonstrate that our method reconstructs 3D scenes with high precision, offering a new pathway toward low-cost, scalable, and robust 3D perception.

88. 【2604.02593】Moondream Segmentation: From Words to Masks

链接https://arxiv.org/abs/2604.02593

作者:Ethan Reid

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:image segmentation extension, present Moondream Segmentation, referring image segmentation, Moondream Segmentation, vision-language model

备注: Demo: [this https URL](https://moondream.ai/me/playground)

点击查看摘要

Abstract:We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

89. 【2604.02586】rackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction

链接https://arxiv.org/abs/2604.02586

作者:Daheng Yin,Isaac Ding,Yili Jin,Jianxin Shi,Jiangchuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent advancements, Gaussian Splatting, dynamic scene reconstruction, efficient and photorealistic, immersive media

备注: 11 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated its potential for efficient and photorealistic 3D reconstructions, which is crucial for diverse applications such as robotics and immersive media. However, current Gaussian-based methods for dynamic scene reconstruction struggle with large inter-frame displacements, leading to artifacts and temporal inconsistencies under fast object motions. To address this, we introduce \textit{TrackerSplat}, a novel method that integrates advanced point tracking methods to enhance the robustness and scalability of 3DGS for dynamic scene reconstruction. TrackerSplat utilizes off-the-shelf point tracking models to extract pixel trajectories and triangulate per-view pixel trajectories onto 3D Gaussians to guide the relocation, rotation, and scaling of Gaussians before training. This strategy effectively handles large displacements between frames, dramatically reducing the fading and recoloring artifacts prevalent in prior methods. By accurately positioning Gaussians prior to gradient-based optimization, TrackerSplat overcomes the quality degradation associated with large frame gaps when processing multiple adjacent frames in parallel across multiple devices, thereby boosting reconstruction throughput while preserving rendering quality. Experiments on real-world datasets confirm the robustness of TrackerSplat in challenging scenarios with significant displacements, achieving superior throughput under parallel settings and maintaining visual quality compared to baselines. The code is available at this https URL.

90. 【2604.02583】FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

链接https://arxiv.org/abs/2604.02583

作者:Wei Li,Yufan Ren,Hanqing Jiang,Jianhui Ding,Zhen Peng,Leman Feng,Yichun Shentu,Guoqiang Xu,Baigui Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-view, visual fusion framework, visual, multi-view visual, multi-view visual fusion

备注: 9 pages, 6 figures, 2 tables

点击查看摘要

Abstract:We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.

91. 【2604.02570】WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

链接https://arxiv.org/abs/2604.02570

作者:Haiyu Wang,Yutong Wang,Jack Jiang,Sai Qian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vision Language Models, Singular Value Decomposition, visual question answering, Vision Language, burden of Vision

备注

点击查看摘要

Abstract:Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: \href{this https URL}{\texttt{this https URL}

92. 【2604.02546】Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

链接https://arxiv.org/abs/2604.02546

作者:Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Language Image Pretraining, Contrastive Language Image, Contrastive Language, Image Pretraining, aligning with Contrastive

备注: 24 pages

点击查看摘要

Abstract:Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. this https URL

93. 【2604.02543】Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

链接https://arxiv.org/abs/2604.02543

作者:Ji Young Byun,Young-Jin Park,Jean-Philippe Corbeil,Asma Ben Abacha

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:clinical decision support, decision support, accuracy is required, equally critical, increasingly deployed

备注

点击查看摘要

Abstract:As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.

94. 【2604.02532】Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

链接https://arxiv.org/abs/2604.02532

作者:Kamalasankari Subramaniakuppusamy,Jugal Gajjar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Post-hoc feature attribution, safety-critical vision systems, remains poorly characterized, Post-hoc feature, realistic input perturbations

备注: Accepted in the proceedings track of XAI4CV Workshop at CVPR 2026. It has 2 images, 5 tables, 6 equations, and 35 references in the main paper and 12 figures, 15 tables, and 3 references in the supplementary material

点击查看摘要

Abstract:Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.

95. 【2604.02509】Rapidly deploying on-device eye tracking by distilling visual foundation models

链接https://arxiv.org/abs/2604.02509

作者:Cheng Jiang,Jogendra Kundu,David Colmenares,Fengting Yang,Joseph Robinson,Yatong An,Ali Behrooz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:virtual reality applications, Eye tracking, plays a critical, reality applications, unlabeled real data

备注

点击查看摘要

Abstract:Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

96. 【2604.02502】An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

链接https://arxiv.org/abs/2604.02502

作者:Md. Sajeebul Islam Sk.,Md. Mehedi Hasan Shawon,Md. Golam Rabiul Alam

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi-view Magnetic Resonance, Lumbar Spinal Stenosis, Magnetic Resonance Imaging, diagnosis heavily dependent, Magnetic Resonance

备注

点击查看摘要

Abstract:Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

97. 【2604.02497】Delaunay Canopy: Building Wireframe Reconstruction from Airborne LiDAR Point Clouds via Delaunay Graph

链接https://arxiv.org/abs/2604.02497

作者:Donghyun Kim,Chanyoung Kim,Youngjoong Kwon,Seong Jae Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables structural understanding, Reconstructing building wireframe, airborne LiDAR point, LiDAR point clouds, Reconstructing building

备注

点击查看摘要

Abstract:Reconstructing building wireframe from airborne LiDAR point clouds yields a compact, topology-centric representation that enables structural understanding beyond dense meshes. Yet a key limitation persists: conventional methods have failed to achieve accurate wireframe reconstruction in regions afflicted by significant noise, sparsity, or internal corners. This failure stems from the inability to establish an adaptive search space to effectively leverage the rich 3D geometry of large, sparse building point clouds. In this work, we address this challenge with Delaunay Canopy, which utilizes the Delaunay graph as a geometric prior to define a geometrically adaptive search space. Central to our approach is Delaunay Graph Scoring, which not only reconstructs the underlying geometric manifold but also yields region-wise curvature signatures to robustly guide the reconstruction. Built on this foundation, our corner and wire selection modules leverage the Delaunay-induced prior to focus on highly probable elements, thereby shaping the search space and enabling accurate prediction even in previously intractable regions. Extensive experiments on the Building3D Tallinn city and entry-level datasets demonstrate state-of-the-art wireframe reconstruction, delivering accurate predictions across diverse and complex building geometries.

98. 【2604.02492】oken-Efficient Multimodal Reasoning via Image Prompt Packaging

链接https://arxiv.org/abs/2604.02492

作者:Joong Ho Choi,Jiayang Zhao,Avani Appalla,Himansh Mukesh,Dhwanil Vasani,Boyi Qian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Deploying large multimodal, Deploying large, Image Prompt Packaging, remains poorly characterized, strategies remains poorly

备注: 9 pages including references

点击查看摘要

Abstract:Deploying large multimodal language models at scale is constrained by token-based inference costs, yet the cost-performance behavior of visual prompting strategies remains poorly characterized. We introduce Image Prompt Packaging (IPPg), a prompting paradigm that embeds structured text directly into images to reduce text token overhead, and benchmark it across five datasets, three frontier models (GPT-4.1, GPT-4o, Claude 3.5 Sonnet), and two task families (VQA and code generation). We derive a cost formulation decomposing savings by token type and show IPPg achieves 35.8--91.0\% inference cost reductions. Despite token compression of up to 96\%, accuracy remains competitive in many settings, though outcomes are highly model- and task-dependent: GPT-4.1 achieves simultaneous accuracy and cost gains on CoSQL, while Claude 3.5 incurs cost increases on several VQA benchmarks. Systematic error analysis yields a failure-mode taxonomy: spatial reasoning, non-English inputs, and character-sensitive operations are most vulnerable, while schema-structured tasks benefit most. A 125-configuration rendering ablation reveals accuracy shifts of 10--30 percentage points, establishing visual encoding choices as a first-class variable in multimodal system design.

99. 【2604.02486】VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

链接https://arxiv.org/abs/2604.02486

作者:Haz Sameen Shahgir,Xiaofu Chen,Yu Fu,Erfan Shayegani,Nael Abu-Ghazaleh,Yova Kementchedjhieva,Yue Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Vision Language Models, Vision Language, Language Models, achieve impressive performance, achieve impressive

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

100. 【2604.02479】Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI

链接https://arxiv.org/abs/2604.02479

作者:Valeria Martin,K. Brent Venable,Derek Morgan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Toggle, labeled satellite imagery, satellite imagery remains, based wildfire monitoring, wildfire monitoring systems

备注: 22 pages, 7 figures

点击查看摘要

Abstract:The scarcity of labeled satellite imagery remains a fundamental bottleneck for deep-learning (DL)-based wildfire monitoring systems. This paper investigates whether a diffusion-based foundation model for Earth Observation (EO), EarthSynth, can synthesize realistic post-wildfire Sentinel-2 RGB imagery conditioned on existing burn masks, without task-specific retraining. Using burn masks derived from the CalFireSeg-50 dataset (Martin et al., 2025), we design and evaluate six controlled experimental configurations that systematically vary: (i) pipeline architecture (mask-only full generation vs. inpainting with pre-fire context), (ii) prompt engineering strategy (three hand-crafted prompts and a VLM-generated prompt via Qwen2-VL), and (iii) a region-wise color-matching post-processing step. Quantitative assessment on 10 stratified test samples uses four complementary metrics: Burn IoU, burn-region color distance ({\Delta}C_burn), Darkness Contrast, and Spectral Plausibility. Results show that inpainting-based pipelines consistently outperform full-tile generation across all metrics, with the structured inpainting prompt achieving the best spatial alignment (Burn IoU = 0.456) and burn saliency (Darkness Contrast = 20.44), while color matching produces the lowest color distance ({\Delta}C_burn = 63.22) at the cost of reduced burn saliency. VLM-assisted inpainting is competitive with hand-crafted prompts. These findings provide a foundation for incorporating generative data augmentation into wildfire detection pipelines. Code and experiments are available at: this https URL

Comments:
22 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.02479 [cs.CV]

(or
arXiv:2604.02479v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.02479

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Valeria Martin Hernandez [view email] [v1]
Thu, 2 Apr 2026 19:25:55 UTC (14,745 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI, by Valeria Martin and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

101. 【2604.02477】Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs

链接https://arxiv.org/abs/2604.02477

作者:Onur Selim Kilic,Yeti Z. Gurbuz,Cem O. Yaldiz,Afra Nawar,Etrit Haxholli,Ogul Can,Eli Waxman

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:executable clinical decision, Clinical practice guidelines, breaks cross-page continuity, clinical decision graph, clinical decision

备注

点击查看摘要

Abstract:Clinical practice guidelines are long, multimodal documents whose branching recommendations are difficult to convert into executable clinical decision support (CDS), and one-shot parsing often breaks cross-page continuity. Recent LLM/VLM extractors are mostly local or text-centric, under-specifying section interfaces and failing to consolidate cross-page control flow across full documents into one coherent decision graph. We present a decomposition-first pipeline that converts full-guideline evidence into an executable clinical decision graph through topology-aware chunking, interface-constrained chunk graph generation, and provenance-preserving global aggregation. Rather than relying on single-pass generation, the pipeline uses explicit entry/terminal interfaces and semantic deduplication to preserve cross-page continuity while keeping the induced control flow auditable and structurally consistent. We evaluate on an adjudicated prostate-guideline benchmark with matched inputs and the same underlying VLM backbone across compared methods. On the complete merged graph, our approach improves edge and triplet precision/recall from $19.6\%/16.1\%$ in existing models to $69.0\%/87.5\%$, while node recall rises from $78.1\%$ to $93.8\%$. These results support decomposition-first, auditable guideline-to-CDS conversion on this benchmark, while current evidence remains limited to one adjudicated prostate guideline and motivates broader multi-guideline validation.

102. 【2604.02468】Hierarchical, Interpretable, Label-Free Concept Bottleneck Model

链接https://arxiv.org/abs/2604.02468

作者:Haodong Xie,Yujun Cai,Rahul Singh Maharjan,Yiwei Wang,Federico Tavella,Angelo Cangelosi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:black-box deep learning, deep learning models, Concept Bottleneck Models, Concept Bottleneck Model, Concept Bottleneck

备注

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) introduce interpretability to black-box deep learning models by predicting labels through human-understandable concepts. However, unlike humans, who identify objects at different levels of abstraction using both general and specific features, existing CBMs operate at a single semantic level in both concept and label space. We propose HIL-CBM, a Hierarchical Interpretable Label-Free Concept Bottleneck Model that extends CBMs into a hierarchical framework to enhance interpretability by more closely mirroring the human cognitive process. HIL-CBM enables classification and explanation across multiple semantic levels without requiring relational concept annotations. HIL-CBM aligns the abstraction level of concept-based explanations with that of model predictions, progressing from abstract to concrete. This is achieved by (i) introducing a gradient-based visual consistency loss that encourages abstraction layers to focus on similar spatial regions, and (ii) training dual classification heads, each operating on feature concepts at different abstraction levels. Experiments on benchmark datasets demonstrate that HIL-CBM outperforms state-of-the-art sparse CBMs in classification accuracy. Human evaluations further show that HIL-CBM provides more interpretable and accurate explanations, while maintaining a hierarchical and label-free approach to feature concepts.

103. 【2604.02467】VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

链接https://arxiv.org/abs/2604.02467

作者:Mengtian Li,Yuwei Lu,Feifei Li,Chenqi Gan,Zhifeng Xie,Xi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Cinematic camera control, tight feedback loop, camera control relies, Cinematic camera, reviewed and refined

备注: 28 pages, 10 figures, ECCV 2026

点击查看摘要

Abstract:Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

104. 【2604.02457】Street-Legal Physical-World Adversarial Rim for License Plates

链接https://arxiv.org/abs/2604.02457

作者:Nikhil Kalidasu,Sahana Ganapathy

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Automatic license plate, Automatic license, ALPR, track vehicles, widely deployed

备注: 20 pages, 8 figures, 5 tables, submitted to Security in Machine Learning Applications 2026

点击查看摘要

Abstract:Automatic license plate reader (ALPR) systems are widely deployed to identify and track vehicles. While prior work has demonstrated vulnerabilities in ALPR systems, far less attention has been paid to their legality and physical-world practicality. We investigate whether low-resourced threat actors can engineer a successful adversarial attack against a modern open-source ALPR system. We introduce the Street-legal Physical Adversarial Rim (SPAR), a physically realizable white-box attack against the popular ALPR system fast-alpr. SPAR requires no access to ALPR infrastructure during attack deployment and does not alter or obscure the attacker's license plate. Based on prior legislation and case law, we argue that SPAR is street-legal in the state of Texas. Under optimal conditions, SPAR reduces ALPR accuracy by 60% and achieves an 18% targeted impersonation rate. SPAR can be produced for under $100, and it was implemented entirely by commercial agentic coding assistants. These results highlight practical vulnerabilities in modern ALPR systems under realistic physical-world conditions and suggest new directions for both attack and defense.

105. 【2604.02447】PlayGen-MoG: Framework for Diverse Multi-Agent Play Generation via Mixture-of-Gaussians Trajectory Prediction

链接https://arxiv.org/abs/2604.02447

作者:Kevin Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Conditional Variational Autoencoders, team sports requires, Multi-agent trajectory generation, sports requires models, Multi-agent trajectory

备注: 9 pages, 4 figures, 2 tables. Accepted to CVPRW 2026

点击查看摘要

Abstract:Multi-agent trajectory generation in team sports requires models that capture both the diversity of possible plays and realistic spatial coordination between players on plays. Standard generative approaches such as Conditional Variational Autoencoders (CVAE) and diffusion models struggle with this task, exhibiting posterior collapse or convergence to the dataset mean. Moreover, most trajectory prediction methods operate in a forecasting regime that requires multiple frames of observed history, limiting their use for play design where only the initial formation is available. We present PlayGen-MoG, an extensible framework for formation-conditioned play generation that addresses these challenges through three design choices: 1/ a Mixture-of-Gaussians (MoG) output head with shared mixture weights across all agents, where a single set of weights selects a play scenario that couples all players' trajectories, 2/ relative spatial attention that encodes pairwise player positions and distances as learned attention biases, and 3/ non-autoregressive prediction of absolute displacements from the initial formation, eliminating cumulative error drift and removing the dependence on observed trajectory history, enabling realistic play generation from a single static formation alone. On American football tracking data, PlayGen-MoG achieves 1.68 yard ADE and 3.98 yard FDE while maintaining full utilization of all 8 mixture components with entropy of 2.06 out of 2.08, and qualitatively confirming diverse generation without mode collapse.

106. 【2604.02446】From Elevation Maps To Contour Lines: SVM and Decision Trees to Detect Violin Width Reduction

链接https://arxiv.org/abs/2604.02446

作者:Philémon Beghin,Anne-Emmanuelle Ceulemans,François Glineur

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:violin width reduction, photogrammetric meshes, explore the automatic, automatic detection, detection of violin

备注: Paper accepted for the Florence Heri-Tech 2026 Conference

点击查看摘要

Abstract:We explore the automatic detection of violin width reduction using 3D photogrammetric meshes. We compare SVM and Decision Trees applied to a geometry-based raw representation built from elevation maps with a more targeted, feature-engineered approach relying on parametric contour lines fitting. Although elevation maps occasionally achieve strong results, their performance does not surpass that of the contour-based inputs.

107. 【2604.02409】LumiVideo: An Intelligent Agentic System for Video Color Grading

链接https://arxiv.org/abs/2604.02409

作者:Yuchen Guo,Junli Gong,Hongmin Cai,Yiu-ming Cheung,Weifeng Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:critical post-production process, resonant cinematic visuals, emotionally resonant cinematic, transforms flat, critical post-production

备注

点击查看摘要

Abstract:Video color grading is a critical post-production process that transforms flat, log-encoded raw footage into emotionally resonant cinematic visuals. Existing automated methods act as static, black-box executors that directly output edited pixels, lacking both interpretability and the iterative control required by professionals. We introduce LumiVideo, an agentic system that mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, LumiVideo autonomously produces a cinematic base grade by analyzing the scene's physical lighting and semantic content. Its Reasoning engine synergizes an LLM's internalized cinematic knowledge with a Retrieval-Augmented Generation (RAG) framework via a Tree of Thoughts (ToT) search to navigate the non-linear color parameter space. Rather than generating pixels, the system compiles the deduced parameters into industry-standard ASC-CDL configurations and a globally consistent 3D LUT, analytically guaranteeing temporal consistency. An optional Reflection loop then allows creators to refine the result via natural language feedback. We further introduce LumiGrade, the first log-encoded video benchmark for evaluating automated grading. Experiments show that LumiVideo approaches human expert quality in fully automatic mode while enabling precise iterative control when directed.

108. 【2604.02397】Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

链接https://arxiv.org/abs/2604.02397

作者:Anderson Augusma(UGA, LIG, M-PSI),Dominique Vaufreydaz(LIG, M-PSI),Fédérique Letué(SVH)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Group Emotion Recognition, Emotion Recognition, aims to infer, public events, social environments

备注

点击查看摘要

Abstract:Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

109. 【2604.02396】Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework

链接https://arxiv.org/abs/2604.02396

作者:Xuejian Zhang,Ruisi He,Minseok Kim,Inocent Calist,Mi Yang,Ziyi Qi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:key enabling technology, renders environment-aware channel, environment-aware channel prediction, channel prediction, enabling technology

备注: 13 pages, 14 figures

点击查看摘要

Abstract:The deep integration of communication with intelligence and sensing, as a defining vision of 6G, renders environment-aware channel prediction a key enabling technology. As a representative 6G application, vehicular communications require accurate and forward-looking channel prediction under stringent reliability, latency, and adaptability demands. Traditional empirical and deterministic models remain limited in balancing accuracy, generalization, and deployability, while the growing availability of onboard and roadside sensing devices offers a promising source of environmental priors. This paper proposes an environment-aware channel prediction framework based on multimodal visual feature fusion. Using GPS data and vehicle-side panoramic RGB images, together with semantic segmentation and depth estimation, the framework extracts semantic, depth, and position features through a three-branch architecture and performs adaptive multimodal fusion via a squeeze-excitation attention gating module. For 360-dimensional angular power spectrum (APS) prediction, a dedicated regression head and a composite multi-constraint loss are further designed. As a result, joint prediction of path loss (PL), delay spread (DS), azimuth spread of arrival (ASA), azimuth spread of departure (ASD), and APS is achieved. Experiments on a synchronized urban V2I measurement dataset yield the best root mean square error (RMSE) of 3.26 dB for PL, RMSEs of 37.66 ns, 5.05 degrees, and 5.08 degrees for DS, ASA, and ASD, respectively, and mean/median APS cosine similarities of 0.9342/0.9571, demonstrating strong accuracy, generalization, and practical potential for intelligent channel prediction in 6G vehicular communications.

110. 【2604.02392】Beyond Fixed Inference: Quantitative Flow Matching for Adaptive Image Denoising

链接https://arxiv.org/abs/2604.02392

作者:Jigang Duan,Genwei Ma,Xu Jiang,Wenfeng Xu,Ping Yang,Xing Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flow-based generative models, Diffusion and flow-based, shown strong potential, flow-based generative, generative models

备注

点击查看摘要

Abstract:Diffusion and flow-based generative models have shown strong potential for image restoration. However, image denoising under unknown and varying noise conditions remains challenging, because the learned vector fields may become inconsistent across different noise levels, leading to degraded restoration quality under mismatch between training and inference. To address this issue, we propose a quantitative flow matching framework for adaptive image denoising. The method first estimates the input noise level from local pixel statistics, and then uses this quantitative estimate to adapt the inference trajectory, including the starting point, the number of integration steps, and the step-size schedule. In this way, the denoising process is better aligned with the actual corruption level of each input, reducing unnecessary computation for lightly corrupted images while providing sufficient refinement for heavily degraded ones. By coupling quantitative noise estimation with noise-adaptive flow inference, the proposed method improves both restoration accuracy and inference efficiency. Extensive experiments on natural, medical, and microscopy images demonstrate its robustness and strong generalization across diverse noise levels and imaging conditions.

111. 【2604.02371】Internalized Reasoning for Long-Context Visual Document Understanding

链接https://arxiv.org/abs/2604.02371

作者:Austin Veselka

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Visual long-document understanding, performing open recipes, Visual long-document, critical for enterprise, scientific applications

备注: 9 pages

点击查看摘要

Abstract:Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{think} tags, gated by a \texttt{cot} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

112. 【2604.02355】From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

链接https://arxiv.org/abs/2604.02355

作者:Han Song,Yucheng Zhou,Jianbing Shen,Yu Cheng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Learning, optimization remains unclear, remains unclear, Group Relative Policy, Relative Policy Optimization

备注

点击查看摘要

Abstract:Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

113. 【2604.02338】LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

链接https://arxiv.org/abs/2604.02338

作者:Md Kowsher,Haris Mansoor,Nusrat Jahan Prottasha,Ozlem Garibay,Victor Zhu,Zhengping Ji,Chen Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:methods combine Mixture, combine Mixture, require separate adapters, adapter-based architectures, parameter-efficient fine-tuning

备注

点击查看摘要

Abstract:MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

114. 【2604.03224】HyperCT: Low-Rank Hypernet for Unified Chest CT Analysis

链接https://arxiv.org/abs/2604.03224

作者:Fengbei Liu,Sunwoo Kwak,Hao Phung,Nusrat Binta Nizam,Ilan Richter,Nir Uriel,Hadar Averbuch-Elor,Daborah Estrin,Mert R. Sabuncu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Non-contrast chest CTs, opportunistic extra-pulmonary screening, chest CTs offer, Non-contrast chest, extra-pulmonary screening

备注: MIDL 2026

点击查看摘要

Abstract:Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (MTL) can unify these diverse tasks, standard hard-parameter sharing approaches are often suboptimal for modeling distinct pathologies. We propose HyperCT, a framework that dynamically adapts a Vision Transformer backbone via a Hypernetwork. To ensure computational efficiency, we integrate Low-Rank Adaptation (LoRA), allowing the model to regress task-specific low-rank weight updates rather than full parameters. Validated on a large-scale dataset of radiological and cardiological tasks, \method{} outperforms various strong baselines, offering a unified, parameter-efficient solution for holistic patient assessment. Our code is available at this https URL.

115. 【2604.03112】ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality

链接https://arxiv.org/abs/2604.03112

作者:Aymen Sekhri,Seyed Ali Amirshahi,Mohamed-Chaker Larabi

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:immersive consumer adoption, Augmented Reality, technologies advance, consumer adoption, advance towards immersive

备注

点击查看摘要

Abstract:As Augmented Reality (AR) technologies advance towards immersive consumer adoption, the need for rigorous Quality of Experience (QoE) assessment becomes critical. However, existing datasets often lack ecological validity, relying on monocular viewing or simplified backgrounds that fail to capture the complex perceptual interplay, termed visual confusion, between real and virtual layers. To address this gap, we present ARIQA-3DS, the first large stereoscopic AR Image Quality Assessment dataset. Comprising 1,200 AR viewports, the dataset fuses high-resolution stereoscopic omnidirectional captures of real-world scenes with diverse augmented foregrounds under controlled transparency and degradation conditions. We conducted a comprehensive subjective study with 36 participants using a video see-through head-mounted display, collecting both quality ratings and simulator-sickness indicators. Our analysis reveals that perceived quality is primarily driven by foreground degradations and modulated by transparency levels, while oculomotor and disorientation symptoms show a progressive but manageable increase during viewing. ARIQA-3DS will be publicly released to serve as a comprehensive benchmark for developing next-generation AR quality assessment models.

116. 【2604.02868】Few-Shot Distribution-Aligned Flow Matching for Data Synthesis in Medical Image Segmentation

链接https://arxiv.org/abs/2604.02868

作者:Jie Yang,Ziqi Ye,Aihua Ke,Jian Luo,Bo Cai,Xiaosong Wang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:heterogeneity hinders clinical, hinders clinical deployment, Data heterogeneity hinders, generative data augmentation, medical image analysis

备注

点击查看摘要

Abstract:Data heterogeneity hinders clinical deployment of medical image analysis models, and generative data augmentation helps mitigate this issue. However, recent diffusion-based methods that synthesize image-mask pairs often ignore distribution shifts between generated and real images across scenarios, and such mismatches can markedly degrade downstream performance. To address this issue, we propose AlignFlow, a flow matching model that aligns with the target reference image distribution via differentiable reward fine-tuning, and remains effective even when only a small number of reference images are provided. Specifically, we divide the training of the flow matching model into two stages: in the first stage, the model fits the training data to generate plausible images; Then, we introduce a distribution alignment mechanism and employ differentiable reward to steer the generated images toward the distribution of the given samples from the target domain. In addition, to enhance the diversity of generated masks, we also design a flow matching based mask generation to complement the diversity in regions of interest. Extensive experiments demonstrate the effectiveness of our approach, i.e., performance improvement by 3.5-4.0% in mDice and 3.5-5.6% in mIoU across a variety of datasets and scenarios.

117. 【2604.02742】ask-Guided Prompting for Unified Remote Sensing Image Restoration

链接https://arxiv.org/abs/2604.02742

作者:Wenli Huang,Yang Wu,Xiaomeng Xin,Zhihong Liu,Jinjun Wang,Ye Deng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote sensing image, enabling accurate downstream, accurate downstream analysis, recovering high-fidelity imagery, Remote sensing

备注: 17 pages, 11 figures

点击查看摘要

Abstract:Remote sensing image restoration (RSIR) is essential for recovering high-fidelity imagery from degraded observations, enabling accurate downstream analysis. However, most existing methods focus on single degradation types within homogeneous data, restricting their practicality in real-world scenarios where multiple degradations often across diverse spectral bands or sensor modalities, creating a significant operational bottleneck. To address this fundamental gap, we propose TGPNet, a unified framework capable of handling denoising, cloud removal, shadow removal, deblurring, and SAR despeckling within a single, unified architecture. The core of our framework is a novel Task-Guided Prompting (TGP) strategy. TGP leverages learnable, task-specific embeddings to generate degradation-aware cues, which then hierarchically modulate features throughout the decoder. This task-adaptive mechanism allows the network to precisely tailor its restoration process for distinct degradation patterns while maintaining a single set of shared weights. To validate our framework, we construct a unified RSIR benchmark covering RGB, multispectral, SAR, and thermal infrared modalities for five aforementioned restoration tasks. Experimental results demonstrate that TGPNet achieves state-of-the-art performance on both unified multi-task scenarios and unseen composite degradations, surpassing even specialized models in individual domains such as cloud removal. By successfully unifying heterogeneous degradation removal within a single adaptive framework, this work presents a significant advancement for multi-task RSIR, offering a practical and scalable solution for operational pipelines. The code and benchmark will be released at this https URL.

118. 【2604.02624】Wavelength-multiplexed massively parallel diffractive optical information storage and image projection

链接https://arxiv.org/abs/2604.02624

作者:Che-Yung Shen,Yuhang Li,Cagatay Isil,Jingxi Li,Leon Lenk,Tianyi Gan,Guangdong Ma,Fazil Onuralp Ardic,Mona Jarrahi,Aydogan Ozcan

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph)

关键词:store and project, composed of dielectric, dielectric surfaces, scale using deep, deep learning

备注: 28 Pages, 8 Figures

点击查看摘要

Abstract:We introduce a wavelength-multiplexed massively parallel diffractive information storage platform composed of dielectric surfaces that are structurally optimized at the wavelength scale using deep learning to store and project thousands of distinct image patterns, each assigned to a unique wavelength. Through numerical simulations in the visible spectrum, we demonstrated that our wavelength-multiplexed diffractive system can store and project over 4,000 independent desired images/patterns within its output field-of-view, with high image quality and minimal crosstalk between spectral channels. Furthermore, in a proof-of-concept experiment, we demonstrated a two-layer diffractive design that stored six distinct patterns and projected them onto the same output field of view at six different wavelengths (500, 548, 596, 644, 692, and 740 nm). This diffractive architecture is scalable and can operate at various parts of the electromagnetic spectrum without the need for material dispersion engineering or redesigning its optimized diffractive layers. The demonstrated storage capacity, reconstruction image fidelity, and wavelength-encoded massively parallel read-out of our diffractive platform offer a compact and fast-access solution for large-scale optical information storage, image projection applications.

119. 【2604.02564】Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

链接https://arxiv.org/abs/2604.02564

作者:Sebo Diaz,Polina Golland,Elfar Adalsteinsson,Neel Dey

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:simple and theoretically-grounded, theoretically-grounded approach, domain generalization, Existing domain generalization, domain generalization methods

备注: Project GitHub [this https URL](https://github.com/sebodiaz/DropGen)

点击查看摘要

Abstract:We present DropGen, a simple and theoretically-grounded approach for domain generalization in 3D biomedical image segmentation. Modern segmentation models degrade sharply under shifts in modality, disease severity, clinical sites, and other factors, creating brittle models that limit reliable deployment. Existing domain generalization methods rely on extreme augmentations, mixing domain statistics, or architectural redesigns, yet incur significant implementation overhead and yield inconsistent performance across biomedical settings. DropGen instead proposes a principled learning strategy with minimal overhead that leverages both source-domain image intensities and domain-stable foundation model representations to train robust segmentation models. As a result, DropGen achieves strong gains in both fully supervised and few-shot segmentation across a broad range of shifts in biomedical studies. Unlike prior approaches, DropGen is architecture- and loss-agnostic, compatible with standard augmentation pipelines, computationally lightweight, and tackles arbitrary anatomical regions. Our implementation is freely available at this https URL.

120. 【2604.02448】Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview

链接https://arxiv.org/abs/2604.02448

作者:Shramana Dey,Zahir Khan,T. A. PramodKumar,B. Uma Shankar,Ashis K. Dhara,Ramachandran Rajalakshmi,Rajiv Raman,Sushmita Mitra

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diabetic Retinopathy, vision loss worldwide, complication of diabetes, loss worldwide, microvascular complication

备注

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a serious microvascular complication of diabetes, and one of the leading causes of vision loss worldwide. Although automated detection and grading, with Deep Learning (DL), can reduce the burden on ophthalmologists, it is constrained by the limited availability of high-quality datasets. Existing repositories often remain geographically narrow, contain limited samples, and exhibit inconsistent annotations or variable image quality; thereby, restricting their clinical reliability. This paper presents a comprehensive review and comparative analysis of fundus image datasets used in the management of DR. The study evaluates their usability across key tasks, including binary classification, severity grading, lesion localization, and multi-disease screening. It also categorizes the datasets by size, accessibility, and annotation type (such as image-level, lesion-level, and multi-disease). Finally, a recently published dataset is presented as a case study to illustrate broader challenges in dataset curation and usage. The review consolidates current knowledge while highlighting persistent gaps such as the lack of standardized lesion-level annotations and longitudinal data. It also outlines recommendations for future dataset development to support clinically reliable and explainable solutions in DR screening.