本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新748篇论文，其中：

自然语言处理106篇
信息检索8篇
计算机视觉186篇

自然语言处理

1. 【2605.21481】AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

作者：Junshu Pan,Panzhong Lu,Yixuan Weng,Qiyao Sun,Fang Guo,Zijie Yang,Qiji Zhou,Yue Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：placing increasing strain, traditional academic publishing, academic publishing systems, journal-centered paradigms amid, paradigms amid rising

备注：

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at this https URL.

2. 【2605.21468】You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

链接：https://arxiv.org/abs/2605.21468

作者：Zhepei Wei,Xinyu Zhu,Wei-Lin Chen,Chengsong Huang,Jiaxin Huang,Yu Meng

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：trajectories remains underexplored, large language models, parameter trajectories remains, resulting parameter trajectories, RLVR weight trajectories

备注： preprint. Code: [this https URL](https://github.com/weizhepei/RELEX)

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at this https URL.

3. 【2605.21467】DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

链接：https://arxiv.org/abs/2605.21467

作者：Kaiyi Zhang,Wei Wu,Yankai Lin

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language models, Reinforcement learning, language models, central technique, technique for improving

备注：

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

4. 【2605.21465】Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

链接：https://arxiv.org/abs/2605.21465

作者：Weixing Zhang,Bowen Jiang,Rahul Sharma,Regina Hebig,Daniel Strüber

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：typically requires tedious, tedious manual work, Large Language Model-based, requires tedious manual, Large Language Models

备注：

点击查看摘要

Abstract:In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

5. 【2605.21463】Mem-$π$: Adaptive Memory through Learning When and What to Generate

链接：https://arxiv.org/abs/2605.21463

作者：Xiaoqiang Wang,Chao Wang,Hadi Nekoei,Christopher Pal,Alexandre Lacoste,Spandana Gella,Bang Liu,Perouz Taslakian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：external memory stores, framework for adaptive, generated on demand, retrieved from external, present Mem

备注： Work in progress

点击查看摘要

Abstract:We present Mem-$\pi$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$\pi$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$\pi$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

6. 【2605.21403】Quantifying the cross-linguistic effects of syncretism on agreement attraction

链接：https://arxiv.org/abs/2605.21403

作者：Utku Turk,Eva Neu

类目：Computation and Language (cs.CL)

关键词：verb erroneously agrees, Agreement attraction errors, grammatical head, principled account, verb erroneously

备注： SCiL Conference Paper

点击查看摘要

Abstract:Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

7. 【2605.21391】Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

链接：https://arxiv.org/abs/2605.21391

作者：Lawhori Chakrabarti,Jennifer Johnson-Leung,Bert Baumgaertner,Aleksandar Vakanski,Min Xian,Boyu Zhang

类目：Computation and Language (cs.CL)

关键词：contextual meaning diverges, Metaphor requires, basic literal sense, contextual meaning, meaning diverges

备注： 18 pages, 3 figures, submitted to ICPR workshop

点击查看摘要

Abstract:Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

8. 【2605.21384】SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

链接：https://arxiv.org/abs/2605.21384

作者：Bingchen Zhao,Dhruv Srikanth,Yuxiang Wu,Zhengyao Jiang

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：oversight collapses, single surface, Reward hacking, automated test suite, long-horizon coding agents

备注：

点击查看摘要

Abstract:As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

9. 【2605.21369】Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

链接：https://arxiv.org/abs/2605.21369

作者：Michal Novák,Miloslav Konopík,Anna Nedoluzhko,Martin Popel,Ondřej Pražák,Jakub Sido,Milan Straka,Zdeněk Žabokrtský,Daniel Zeman

类目：Computation and Language (cs.CL)

关键词：Multilingual Coreference Resolution, Coreference Resolution, Shared Task, held in conjunction, paper describes

备注： Accepted to CODI-CRAC 2026

点击查看摘要

Abstract:This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

Comments:
Accepted to CODI-CRAC 2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.21369 [cs.CL]

(or
arXiv:2605.21369v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.21369

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Milan Straka [view email] [v1]
Wed, 20 May 2026 16:35:09 UTC (227 KB)

10. 【2605.21363】"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

链接：https://arxiv.org/abs/2605.21363

作者：Eunsu Kim,Jessica R. Mindel,Kyungjin Kim,Sherry Tongshuang Wu

类目：Computation and Language (cs.CL)

关键词：evaluators assessing AI-assisted, large language models, increasingly shape, large language, evaluators assessing

备注：

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

11. 【2605.21362】LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

链接：https://arxiv.org/abs/2605.21362

作者：Abdullah Al Nomaan Nafi,Fnu Suya,Swarup Bhunia,Prabuddha Chakraborty

类目：Computation and Language (cs.CL)

关键词：intended safety behavior, aligned large language, large language models, safety behavior, adversarial prompting

备注：

点击查看摘要

Abstract:Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

12. 【2605.21338】xt Analytics Evaluation Framework: A Case Study on LLMs and Social Media

链接：https://arxiv.org/abs/2605.21338

作者：Yuefeng Shi,Nedjma Ousidhoum,Jose Camacho-Collados

类目：Computation and Language (cs.CL)

关键词：demonstrated exceptional proficiency, demonstrated exceptional, exceptional proficiency, wide range, NLP tasks

备注：

点击查看摘要

Abstract:LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

13. 【2605.21333】SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

链接：https://arxiv.org/abs/2605.21333

作者：Ting Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：combine Transformer-like language, stable multi-domain pre-training, Natively trained spiking, Transformer-like language quality, spiking language models

备注： 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

点击查看摘要

Abstract:Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at 89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

Comments:
35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.21333 [cs.CL]

(or
arXiv:2605.21333v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.21333

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

14. 【2605.21318】xtReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

链接：https://arxiv.org/abs/2605.21318

作者：Lucheng Fu,Ye Yu,Yiyang Wang,Yiqiao Jin,Haibo Jin,B. Aditya Prakash,Haohan Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, Large language, language models, behavioral constraints, highly sensitive

备注： Code: [this https URL](https://github.com/luchengfu6/TextReg)

点击查看摘要

Abstract:Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

15. 【2605.21299】racing the ongoing emergence of human-like reasoning in Large Language Models

链接：https://arxiv.org/abs/2605.21299

作者：Paolo Morosi,Nikoleta Pantelidou,Fritz Günther,Elena Pagliarini,Evelina Leivada

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：mow the lawn, literal meanings, fifty dollars, hearers hunger, give you fifty

备注：

点击查看摘要

Abstract:Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

16. 【2605.21256】Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

链接：https://arxiv.org/abs/2605.21256

作者：Rodrigo Morales-Sánchez,Soto Montalvo,Raquel Martínez

类目：Computation and Language (cs.CL)

关键词：Natural Language Processing, clinical Natural Language, Language Processing, Natural Language, Human Immunodeficiency Virus

备注： Accepted at the BioNLP Workshop @ ACL 2026

点击查看摘要

Abstract:Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.

17. 【2605.21235】LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

链接：https://arxiv.org/abs/2605.21235

作者：Zhe Yuan,Yipeng Zhou,Jinghan Li,Xinyuan Chen,Bowen Deng,Zhiqian Chen,Liang Zhao

类目：Computation and Language (cs.CL)

关键词：scientific question answering, Reinforcement learning, improving reasoning language, reasoning language models, question answering

备注：

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

18. 【2605.21227】Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

链接：https://arxiv.org/abs/2605.21227

作者：Nina Hosseini-Kivanani

类目：Computation and Language (cs.CL)

关键词：respect community norms, writing assistance, respect community, community norms, borrowing

备注： Accepted to Neollm colocated with LREC2026, Three figures and three tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

19. 【2605.21182】Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

链接：https://arxiv.org/abs/2605.21182

作者：Jeonghun Baek,Atsuyuki Miyai,Shota Onohara,Hikaru Ikuta,Kiyoharu Aizawa

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Japanese popular culture, forms of Japanese, Japanese popular, culturally distinctive multimodal, distinctive multimodal medium

备注： Accepted to the Culture x AI Workshop at ICML 2026. Project page: [this https URL](https://manga109.github.io/manga109-project-website/en/)

点击查看摘要

Abstract:Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

20. 【2605.21178】Metaphors in Literary Post-Editing: Opening Pandora's Box?

链接：https://arxiv.org/abs/2605.21178

作者：Aletta G. Dorst,Mayra O. Nas,Katinka Zeven

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Neu ral Machine, ral Machine Translation, Language Models, translated by Neu

备注： This paper has been accepted for presentation at the EAMT Conference 2026, which will take place in Tilburg from June 15 to 18, 2026

点击查看摘要

Abstract:This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neu ral Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of fig urative language is indeed problematic in literary MT (LitMT). The responses indi cate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT out put as quite poor and stated that the post editing was more work and more effort than it would have been translating from scratch. This supports previous studies ar guing that post-editing constrains transla tors in their creativity and diminishes their sense of text ownership.

21. 【2605.21177】ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

链接：https://arxiv.org/abs/2605.21177

作者：Yongkang Liu,Zijing Wang,Mengjie Zhao,Ercong Nie,Mingyang Wang,Qian Li,Feiliang Ren,Shi Feng,Daling Wang,Hinrich Schütze

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：activated working set, dynamically activated working, textsc, ChunkFT, work presents

备注：

点击查看摘要

Abstract:This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on this https URL.

22. 【2605.21154】Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

链接：https://arxiv.org/abs/2605.21154

作者：Fernando Ortega,Raúl Lara-Cabrera,Jorge Dueñas-Lerín,Alejandro de la Torre-Luque,Mercé Salvador Robert,Enrique Baca-García

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：massive administrative burden, Natural Language Processing, Mental health, global priority, Large Language Models

备注：

点击查看摘要

Abstract:Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

23. 【2605.21147】SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

链接：https://arxiv.org/abs/2605.21147

作者：Yongkang Liu,Xing Li,Mengjie Zhao,Shanru Zhang,Zijing Wang,Qian Li,Shi Feng,Feiliang Ren,Daling Wang,Hinrich Schütze

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：tailoring pre-trained large, pre-trained large language, large language models, go-to choice, choice for tailoring

备注：

点击查看摘要

Abstract:As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model's performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbf{S}pectrum \textbf{Mo}dulation \textbf{A}dapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

24. 【2605.21135】Smarter edits? Post-editing with error highlights and translation suggestions

链接：https://arxiv.org/abs/2605.21135

作者：Fleur V.J. van Tellingen,Gautam Ranka,Dora Žugčić,Joyce van der Wal,Andrea Camasta,Livio Guerra,Alina Karakanta

类目：Computation and Language (cs.CL)

关键词：usefulness remains limited, enhanced post-editing features, interest in enhanced, remains limited, error highlights

备注： Accepted at EAMT 2026

点击查看摘要

Abstract:As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

25. 【2605.21102】ACL-Verbatim: hallucination-free question answering for research

链接：https://arxiv.org/abs/2605.21102

作者：Gábor Recski,Szilveszter Tóth,Nadia Verdha,István Boros,Ádám Kovács

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词：Large Language Models, Large Language, collecting high-quality information, produce factually inaccurate, tendency of Large

备注： 13 pages

点击查看摘要

Abstract:Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

26. 【2605.21097】WCXB: A Multi-Type Web Content Extraction Benchmark

链接：https://arxiv.org/abs/2605.21097

作者：Murrough Foley

类目：Computation and Language (cs.CL)

关键词：NLP dataset construction, Web content extraction, language model training, page main content, large language model

备注： Dataset: [this http URL](http://github.com/Murrough-Foley/web-content-extraction-benchmark) , [this http URL](http://doi.org/10.5281/zenodo.19316874) . Leaderboard: [this http URL](http://webcontentextraction.org) . Preprint also deposited at [this http URL](http://doi.org/10.5281/zenodo.19664685)

点击查看摘要

Abstract:Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

27. 【2605.21086】LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

链接：https://arxiv.org/abs/2605.21086

作者：Seogyeong Jeong,Kiwoong Park,Seyoung Song,Eunsu Kim,Ken E. Friedl,Jaeho Kim,Alice Oh

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, real-world deployment requirements, domain-specific evaluation standards, evaluation standards tailored

备注： To appear in ACL 2026 Industry Track

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

28. 【2605.21076】GradeLegal: Automated Grading for German Legal Cases

链接：https://arxiv.org/abs/2605.21076

作者：Abdullah Al Zubaer,Lorenz Wendlinger,Simon Alexander Nonn,Michael Granitzer,Jelena Mitrovic

类目：Computation and Language (cs.CL)

关键词：faces growing volumes, solutions faces growing, German legal exam, qualified graders, creating a bottleneck

备注：

点击查看摘要

Abstract:Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

29. 【2605.21071】Fine-grained Claim-level RAG Benchmark for Law

链接：https://arxiv.org/abs/2605.21071

作者：Souvick Das,Sallam Abualhaija,Domenico Bianculli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, shifting semantic search, LLMs generate responses, legal RAG systems, RAG systems

备注：

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

30. 【2605.21063】APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

链接：https://arxiv.org/abs/2605.21063

作者：Philipp Spohn,Leander Girrbach,Zeynep Akata

类目：Computation and Language (cs.CL)

关键词：Typical LLM responses, Typical LLM, LLM responses tend, tend to follow, follow a default

备注：

点击查看摘要

Abstract:Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

31. 【2605.21049】Cross-lingual robustness of LLM-brain alignment and its computational roots

链接：https://arxiv.org/abs/2605.21049

作者：Ni Yang,Rui He,Philipp Homan,Iris Sommer,Davide Staub,Wolfram Hinzen

类目：Computation and Language (cs.CL)

关键词：Large language models, reliably predict neural, Large language, reliably predict, comprehension and transformer

备注：

点击查看摘要

Abstract:Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

32. 【2605.21029】Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

链接：https://arxiv.org/abs/2605.21029

作者：Stephen Meisenbacher,Peter Norlander

类目：Computation and Language (cs.CL)

关键词：potentially complex domains, Utilizing LLMs, automated taxonomy construction, taxonomy construction presents, complex domains

备注： 14 pages, 2 figures, 8 tables. Accepted to CustomNLP4U 2026

点击查看摘要

Abstract:Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

33. 【2605.21027】Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

链接：https://arxiv.org/abs/2605.21027

作者：Gundeep Singh,Parsa Kavehzadeh,Jing Xia,Xue-Yong Fu,Julien Bouvier Tremblay,Md Tahmid Rahman Laskar,Vincent Lum,Shashi Bhushan TN

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：make organizational data, organizational data accessible, traditional business intelligence, business intelligence tools, Large Language Models

备注： The first four authors contributed equally to this work

点击查看摘要

Abstract:Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

34. 【2605.21006】Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

链接：https://arxiv.org/abs/2605.21006

作者：Ishaan Kelkar,Nebras Alam,Vikram Kakaria,Madhur Panwar,Vasu Sharma,Maheep Chaudhary

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Contrastive Activation Addition, Contrastive Activation, Activation Addition, CAA, sycophancy

备注：

点击查看摘要

Abstract:We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: this https URL.

35. 【2605.20998】Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

链接：https://arxiv.org/abs/2605.20998

作者：Yan Xia,Zhuangzhuang Pan,Amirrudin Kamsin,Chee Seng Chan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Aspect-Term Sentiment Analysis, Sentiment Analysis, Aspect-Term Sentiment, efficiency and expressiveness, faces a fundamental

备注： Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

点击查看摘要

Abstract:Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M = 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at this https URL

36. 【2605.20994】owards Context-Invariant Safety Alignment for Large Language Models

链接：https://arxiv.org/abs/2605.20994

作者：Yixu Wang,Yang Yao,Xin Wang,Yifeng Gao,Yan Teng,Xingjun Ma,Yingchun Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Preference-based post-training aligns, post-training aligns LLMs, Preference-based post-training, remains brittle, post-training aligns

备注： ICML 2026

点击查看摘要

Abstract:Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

37. 【2605.20967】ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization

链接：https://arxiv.org/abs/2605.20967

作者：Wajdi Zaghouani,Kais Attia,Md. Rafiul Biswas,Fadhl Eryani

类目：Computation and Language (cs.CL)

关键词：Arab world, Arabic political memes, Arabic political discourse, Arabic political, cultural positions

备注： Accepted at LREC 2026 Main Conference

点击查看摘要

Abstract:Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

38. 【2605.20960】JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

链接：https://arxiv.org/abs/2605.20960

作者：Wajdi Zaghouani,Shimaa Amer Ibrahim,Mabrouka Bessghaier,Houda Bouamor

类目：Computation and Language (cs.CL)

关键词：job announcements collected, Arabic job announcements, paper introduces JobArabi, paper introduces, job announcements

备注： Accepted at LREC 2026 Main Conference

点击查看摘要

Abstract:This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

39. 【2605.20948】Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

链接：https://arxiv.org/abs/2605.20948

作者：Runxi Cheng,Yuchen Guan,Yongxian Wei,Qianpu Sun,Qixiu Li,Sinan Du,Feng Xiong,Chun Yuan,Yan Lu,Yeyun Gong

类目：Computation and Language (cs.CL)

关键词：Engram learn large, large memory tables, learn large memory, making memory scaling, memory scaling expensive

备注： 25 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

40. 【2605.20946】hinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

链接：https://arxiv.org/abs/2605.20946

作者：Xuan Du,Qiangyu Yan,Wenshuo Li,Borui Jiang,Changming Xiao,Han Shu,Xinghao Chen

类目：Computation and Language (cs.CL)

关键词：paradigm aims, communication more human, aims to make, make AI communication, Linguistic Quality Reward

备注：

点击查看摘要

Abstract:The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

41. 【2605.20936】DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

链接：https://arxiv.org/abs/2605.20936

作者：Weizhe Chen,Miao Zhang,Junpeng Jiang,Yaping Li,Weili Guan,Liqiang Nie

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：improving LLM inference, LLM inference efficiency, increasingly important paradigm, improving LLM, LLM inference

备注： 19 pages, 7 figures

点击查看摘要

Abstract:Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

42. 【2605.20924】Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

链接：https://arxiv.org/abs/2605.20924

作者：Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Designing effective task-level, Designing effective, Large Language Models, Large Language, Language Models

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

43. 【2605.20920】Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

链接：https://arxiv.org/abs/2605.20920

作者：Vinicius Ribeiro,Yves Laprie

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Recent advances, phonetic sequences, advances in machine, machine learning, conditioned on phonetic

备注： Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

点击查看摘要

Abstract:Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

Comments:
Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

Subjects:

Computation and Language (cs.CL); Sound (cs.SD)

Cite as:
arXiv:2605.20920 [cs.CL]

(or
arXiv:2605.20920v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.20920

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

44. 【2605.20916】ask-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

链接：https://arxiv.org/abs/2605.20916

作者：Yaping Chai,Haoran Xie,Joe S. Qin

类目：Computation and Language (cs.CL)

关键词：explicit opinion words, Implicit sentiment analysis, Implicit sentiment, opinion words, inferred from events

备注： 8 pages, 4 figures, and 3 tables

点击查看摘要

Abstract:Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at this https URL.

45. 【2605.20915】Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

链接：https://arxiv.org/abs/2605.20915

作者：Divyaksh Shukla,Ashutosh Modi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：specific training data, uncertainty estimation essential, making reliable prediction, preserving reliable behavior, Local Mutual Information

备注： Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

46. 【2605.20912】Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

链接：https://arxiv.org/abs/2605.20912

作者：Dimitris Roussis,Sokratis Sofianopoulos,Stelios Piperidis

类目：Computation and Language (cs.CL)

关键词：necessitates effective communication, research necessitates effective, scientific research necessitates, increasing volume, necessitates effective

备注：

点击查看摘要

Abstract:The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

47. 【2605.20876】rminal-World: Scaling Terminal-Agent Environments via Agent Skills

链接：https://arxiv.org/abs/2605.20876

作者：Zihao Cheng,Hongru Wang,Zeming Liu,Xinyi Wang,Xiangrong Zhu,Yuhang Guo,Wei Lin,Jeff Z. Pan,Yunhong Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：extend Large Language, Large Language Models, Terminal agents extend, agents extend Large, Large Language

备注： Work in Progress

点击查看摘要

Abstract:Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

48. 【2605.20833】MemGym: a Long-Horizon Memory Environment for LLM Agents

链接：https://arxiv.org/abs/2605.20833

作者：Wujiang Xu,Yu Wang,Kai Mei,Kaiqu Liang,Zhenting Wang,Mingyu Jin,Han Zhang,Shi-Xiong Zhang,Wenyue Hua,Sambit Sahu,Dimitris N. Metaxas

类目：Computation and Language (cs.CL)

关键词：LLM agents operating, capability for LLM, LLM agents, long-horizon tasks, central capability

备注：

点击查看摘要

Abstract:Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

49. 【2605.20815】GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

链接：https://arxiv.org/abs/2605.20815

作者：Peter Fernandes,Ria Kanjilal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Graph-based Retrieval Augmented, Retrieval Augmented Generation, extends retrieval-augmented generation, Augmented Generation, Electronic Health Record

备注： 9 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

50. 【2605.20813】PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

链接：https://arxiv.org/abs/2605.20813

作者：Yanyi Lyu,Letian Chen,Futing Sun,Miao Zhang,Weili Guan,Liqiang Nie

类目：Computation and Language (cs.CL)

关键词：computationally expensive, full self-attention, repeatedly executed, Inference, Inference in diffusion

备注：

点击查看摘要

Abstract:Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

51. 【2605.20809】Refining and Reusing Annotation Guidelines for LLM Annotation

链接：https://arxiv.org/abs/2605.20809

作者：Kon Woo Kim,Jin-Dong Kim,Akiko Aizawa

类目：Computation and Language (cs.CL)

关键词：Large Language Models, demonstrate remarkable performance, Large Language, zero-shot annotation tasks, Language Models

备注： 14 pages, 7 figures. Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

52. 【2605.20798】Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

链接：https://arxiv.org/abs/2605.20798

作者：Yang Zhao,Jiahao Lu,Bin Huang,Guhua Zhang,Jie Zhou

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：downstream evaluation, Transformer modifications, Narang, modifications, downstream

备注： 19 pages, 3 figures, under review at EMNLP 2026

点击查看摘要

Abstract:Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

53. 【2605.20793】Assessing socio-economic climate impacts from text data

链接：https://arxiv.org/abs/2605.20793

作者：Mariana Madruga de Brito,Brielen Madureira,Taís Maria Nunes Carvalho,Damien Delforge,Aglaé Jézéquel,Murathan Kurfalı,Ni Li,Gabriele Messori,Joakim Nivre,Barbara Pernici,Niko Speybroeck,Stefano Terzi,Wim Thiery,Bram Valkenborg,Jingxian Wang,Shorouq Zahra,Jakob Zscheischler,Jan Sodoge

类目：Computation and Language (cs.CL)

关键词：natural language processing, large language models, Recent advances, large-scale textual data, language processing

备注： Work in progress

点击查看摘要

Abstract:Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

54. 【2605.20786】Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

链接：https://arxiv.org/abs/2605.20786

作者：Wajdi Zaghouani

类目：Computation and Language (cs.CL)

关键词：English or Chinese, historically underserved relative, reflects on twenty, twenty years, spoken by hundreds

备注： Accepted at the ACL 2026 Workshop : The Big Picture 2026: Crafting a Research Narrative v2

点击查看摘要

Abstract:This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

55. 【2605.20767】he Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

链接：https://arxiv.org/abs/2605.20767

作者：Victoria Lin,Taedong Yun,Maja Matarić,John Canny,Arthur Gretton,Alexander D'Amour

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)

关键词：Large language models, Large language, language models, human behavior, offering a scalable

备注：

点击查看摘要

Abstract:Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

56. 【2605.20761】Findings of the Counter Turing Test: AI-Generated Text Detection

链接：https://arxiv.org/abs/2605.20761

作者：Rajarshi Roy,Gurpreet Singh,Ashhar Aziz,Shashwat Bajpai,Nasrin Imanpour,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Amitava Das,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha

类目：Computation and Language (cs.CL)

关键词：introduced significant challenges, rapid proliferation, introduced significant, maintaining the integrity, integrity of digital

备注： Defactify4 @AAAI 2025

点击查看摘要

Abstract:The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

Comments:
Defactify4 @AAAI 2025

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.20761 [cs.CL]

(or
arXiv:2605.20761v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.20761

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

57. 【2605.20745】he Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

链接：https://arxiv.org/abs/2605.20745

作者：Yefan Zhou,Yilun Zhou,Austin Xu,Soroush Vosoughi,Shafiq Joty,Jiang Gui

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：reject correct reasoning, miss erroneous steps, Generative verifiers, poorly calibrated, correct reasoning

备注：

点击查看摘要

Abstract:Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at this https URL.

58. 【2605.20743】Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

链接：https://arxiv.org/abs/2605.20743

作者：Juncheng Hu,Jiawei Du,Xin Zhang,Joey Tianyi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Vision-language models solve, solve geometry problems, drawing code carries, models solve geometry, intermediate states remain

备注：

点击查看摘要

Abstract:Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at this https URL

59. 【2605.20740】Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

链接：https://arxiv.org/abs/2605.20740

作者：Jungsoo Park,Hyungjoo Chae,Ethan Mendes,Jay DeYoung,Varsha Kishore,Wei Xu,Alan Ritter

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：floating-point number independently, improving point estimates, predict real-valued quantities, Large language models, decoded floating-point number

备注： 21 pages, 5 figures

点击查看摘要

Abstract:Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

60. 【2605.20730】Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

链接：https://arxiv.org/abs/2605.20730

作者：Jihoon Kwon,Jiwon Choi,Jy-yong Sohn

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：context length increases, In-context learning, escalating inference costs, large language models, task

备注： 9 pages, preprint

点击查看摘要

Abstract:In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

61. 【2605.20729】MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

链接：https://arxiv.org/abs/2605.20729

作者：Junhao Ruan,Abudukeyumu Abudula,Bei Li,Yongjing Yin,Xinyu Liu,Kechen Jiao,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,Jingbo Zhu

类目：Computation and Language (cs.CL)

关键词：advancing Retrieval-Augmented Generation, Accurate evaluation, Retrieval-Augmented Generation, RAG, pivotal for advancing

备注： Accepted to ACL 2026 (main conference). 28 pages. Code and data: [this https URL](https://github.com/rangehow/mtr-suite)

点击查看摘要

Abstract:Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at this https URL.

62. 【2605.20712】SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

链接：https://arxiv.org/abs/2605.20712

作者：Kavya Manohar,Arghya Bhattacharya,Kush Juvekar,Kumarmanas Nethil

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automatic speech recognition, speech recognition replaces, recognition replaces typing, Automatic speech, domain term costs

备注： Submitted to Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

63. 【2605.20693】Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

链接：https://arxiv.org/abs/2605.20693

作者：Tong Wang,Yiqing Xu,Leo Yang Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词：auditors to apply, Interpretable text representations, Interpretable text, independent auditors, text

备注：

点击查看摘要

Abstract:Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

64. 【2605.20689】DIVE: Embedding Compression via Self-Limiting Gradient Updates

链接：https://arxiv.org/abs/2605.20689

作者：Dongfang Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：vector search systems, large language models, language models impose, models impose significant, impose significant storage

备注：

点击查看摘要

Abstract:High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

65. 【2605.20684】Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

链接：https://arxiv.org/abs/2605.20684

作者：Linus Ng Junjia,Ezekiel Tee Kongquan,Kelvin Heng,Kenneth Zhu Ke,Zhao Jing Yuan

类目：Computation and Language (cs.CL)

关键词：Corporate credit underwriting, extract actionable evidence, documents spanning hundreds, credit underwriting requires, underwriting requires analysts

备注：

点击查看摘要

Abstract:Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.20684 [cs.CL]

(or
arXiv:2605.20684v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.20684

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

66. 【2605.20668】On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

链接：https://arxiv.org/abs/2605.20668

作者：Seungone Kim,Dongkeun Yoon,Kiril Gashteovski,Juyoung Suk,Jinheon Baek,Pranjal Aggarwal,Ian Wu,Viktor Zaverkin,Spase Petkoski,Daniel R. Schrider,Ilija Dukovski,Francesco Santini,Biljana Mitreska,Yong Jeong,Kyeongha Kwon,Young Min Sim,Dragana Manasova,Arthur Porto,Biljana Mojsoska,Makoto Takamoto,Marko Shuntov,Ruoqi Liu,Hyunjoo Jenny Lee,Niyazi Ulas Dinç,Yehhyun Jo,Sunkyu Han,Chungwoo Lee,Huishan Li,Esther H. R. Tsai,Ergun Simsek,Khushboo Shafi,Yeonseung Chung,Jihye Park,Aleksandar Shulevski,Henrik Christiansen,Yoosang Son,Elly Knight,Amanda Montoya,Jeongyoun Ahn,Christian Langkammer,Heera Moon,Changwon Yoon,Nikola Stikov,Mooseok Jang,Edward Choi,Junhan Kim,Yeon Sik Jung,Woo Youn Kim,Jae Kyoung Kim,Ishraq Md Anjum,Hyun Uk Kim,Drew Bridges,Carolin Lawrence,Xiang Yue,Alice Oh,Akari Asai,Sean Welleck,Graham Neubig

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：scientists simply view, scientific peer review, evaluate research, deployed in scientific, scientific peer

备注： Work in progress

点击查看摘要

Abstract:With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

67. 【2605.20643】AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

链接：https://arxiv.org/abs/2605.20643

作者：Duy Nguyen,Hanqi Xiao,Archiki Prasad,Zaid Khan,Anirban Das,Austin Zhang,Sambit Sahu,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：enables language models, Self-distillation enables language, privileged information unavailable, enables language, learn on-policy

备注： Code: [this https URL](https://github.com/duykhuongnguyen/AVSD)

点击查看摘要

Abstract:Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

68. 【2605.20628】Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

链接：https://arxiv.org/abs/2605.20628

作者：Sylvey Lin,Joe Menke,Shufan Ming,Dongin Nam,Neil Smalheiser,Halil Kilicoglu

类目：Computation and Language (cs.CL)

关键词：downstream NLP applications, biomedical knowledge discovery, NLP applications, downstream NLP, Biomedical abstracts play

备注： Accepted by BioNLP 2026

点击查看摘要

Abstract:Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at this https URL and this https URL.

69. 【2605.20626】Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

链接：https://arxiv.org/abs/2605.20626

作者：Aashish Dhawan,Christopher Driggers-Ellis,Dzmitry Kasinets,Daisy Zhe Wang,Christan Grant

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Florida Gators submission, University of Florida, Florida Gators, cultural image captioning, present the University

备注：

点击查看摘要

Abstract:We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain 150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

70. 【2605.20616】Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

链接：https://arxiv.org/abs/2605.20616

作者：Chongrui Ye,Yuxiang Liu,Yu Wang,Haofei Yu,Yining Zhao,Ge Liu,Julian McAuley,Jiaxuan You

类目：Computation and Language (cs.CL)

关键词：Language agents increasingly, agents increasingly operate, Language agents, convert accumulated experience, related tasks

备注： Preprint

点击查看摘要

Abstract:Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

71. 【2605.20613】HRM-Text: Efficient Pretraining Beyond Scaling

链接：https://arxiv.org/abs/2605.20613

作者：Guan Wang,Changling Liu,Chenyu Wang,Cai Zhou,Yuhao Sun,Yifei Wu,Shuai Zhen,Luca Scimeca,Yasin Abbasi Yadkori

类目：Computation and Language (cs.CL)

关键词：internet-scale raw text, current pretraining paradigm, raw text, creating a significant, paradigm for large

备注：

点击查看摘要

Abstract:The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

72. 【2605.20602】Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

链接：https://arxiv.org/abs/2605.20602

作者：Ming Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Successive self-training, diversity drops, distributions narrow, process of flattening, widely characterized

备注： 19 pages (14 main + 5 appendix), 8 figures, 3 tables

点击查看摘要

Abstract:Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

73. 【2605.20591】Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

链接：https://arxiv.org/abs/2605.20591

作者：Sunday Oyinlola Ogundoyin,Muhammad Ikram,Rahat Masood

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：provide clinical guidance, custom medical GPTs, including custom medical, Medical large language, large language models

备注：

点击查看摘要

Abstract:Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

74. 【2605.20588】Direct Translation between Sign Languages

链接：https://arxiv.org/abs/2605.20588

作者：Zetian Wu,Bowen Xie,Wuyang Meng,Milan Gautam,Stefan Lee,Liang Huang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：witnessed significant progress, remains largely unexplored, languages remains largely, sign language, sign

备注：

点击查看摘要

Abstract:The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text-sign (T2S) and sign-sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

75. 【2605.20563】Multi-agent Collaboration with State Management

链接：https://arxiv.org/abs/2605.20563

作者：Mengyang Liu,Taozhi Chen,Zhenhua Xu,Xue Jiang,Yihong Dong

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：solving complex tasks, shown great potential, Recent advances, complex tasks, shown great

备注：

点击查看摘要

Abstract:Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

76. 【2605.20558】When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

链接：https://arxiv.org/abs/2605.20558

作者：Wen Zhang

类目：Computation and Language (cs.CL)

关键词：achieve high aggregate, high aggregate accuracy, Neural morphological generation, conceal systematic errors, systematic errors concentrated

备注：

点击查看摘要

Abstract:Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

77. 【2605.20537】What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

链接：https://arxiv.org/abs/2605.20537

作者：Robert Leaman,Rezarta Islamaj,Zhiyong Lu

类目：Computation and Language (cs.CL)

关键词：named entity recognition, strongly depend, depend on annotated, Biomedical named entity, entity recognition

备注： Accepted to the ACL 25th Workshop on Biomedical Language Processing

点击查看摘要

Abstract:Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

78. 【2605.20530】AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

链接：https://arxiv.org/abs/2605.20530

作者：Parsa Mazaheri,Kasra Mazaheri

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：final task success, Large language model, Large language, language model agents, operating systems

备注：

点击查看摘要

Abstract:Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model's trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.

79. 【2605.20529】Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

链接：https://arxiv.org/abs/2605.20529

作者：Claire Hobbs,R. Thomas McCoy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：linguistic input assist, signals in linguistic, English subject-verb agreement, statistical signals, subject-verb agreement

备注： Accepted to CoNLL

点击查看摘要

Abstract:In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

80. 【2605.20525】NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

链接：https://arxiv.org/abs/2605.20525

作者：Mohammad H. Abbasi,Favour Nerrise,Shaurnav Ghosh,Ridvan Yesiloglu,Yuncong Mao,Bailey Trang,Mohammad Asadi,Merryn Daniel,Gustavo Chau Loo Kung,Ken Chang,Pavan Pinkesh Shah,Adam Turnbull,Kyan Younes,Seena Dehkharghani,Ehsan Adeli(Stanford University)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：brain magnetic resonance, magnetic resonance imaging, visual question answering, visual question, question answering

备注： 30 pages, dataset and benchmark release

点击查看摘要

Abstract:We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

81. 【2605.20506】Reinforcing Human Behavior Simulation via Verbal Feedback

链接：https://arxiv.org/abs/2605.20506

作者：Weiwei Sun,Xuhui Zhou,Jiarui Liu,Weihua Du,Haojia Sun,Yiqing Xie,Qianou Ma,Sihao Chen,Mengting Wan,Longqi Yang,Pei Zhou,Sherry Wu,Sean Welleck,Graham Neubig,Yiming Yang,Maarten Sap

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：learn social norms, friend explaining, Humans learn social, verbal feedback, simulate human behavior

备注：

点击查看摘要

Abstract:Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

82. 【2605.20478】Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

链接：https://arxiv.org/abs/2605.20478

作者：Chen Shen

类目：Computation and Language (cs.CL)

关键词：retroactively attach page-level, attach page-level citations, unsupported rows, recall entries, entries from parametric

备注： 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

点击查看摘要

Abstract:LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

83. 【2605.20477】raining Language Agents to Learn from Experience

链接：https://arxiv.org/abs/2605.20477

作者：Yuval Shalev,Zifeng Ding,Mateja Jamnik

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：current reflection-based methods, single task instance, current reflection-based, reflection-based methods, Language agents

备注：

点击查看摘要

Abstract:Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

84. 【2605.20435】Hiding in Plain Sight: Finding MAHA on Reddit

链接：https://arxiv.org/abs/2605.20435

作者：Sabit Ahmed,Subigya Nepal,Henry Kautz

类目：ocial and Information Networks (cs.SI); Computation and Language (cs.CL)

关键词：Make America Healthy, genetically modified food, Make America, broadly accepted concerns, America Healthy

备注： Submitted to ASONAM 2026

点击查看摘要

Abstract:Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

85. 【2605.20410】Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

链接：https://arxiv.org/abs/2605.20410

作者：Edie Pearman,Sophia Osborne,Mira Kandlikar-Bloch,Mina Arzaghi,Florian Carichon,Golnoosh Farnadi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, socially sensitive settings, Large language, encode gender biases, increasingly deployed

备注： 24 pages, 6 figures, including appendix. Accepted at the ICLR 2026 Workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems. Submitted to COLM 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

86. 【2605.20404】Puzzled By ChatGPT? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness

链接：https://arxiv.org/abs/2605.20404

作者：Francesca Padovani,Malvina Nissim

类目：Computation and Language (cs.CL)

关键词：including LLM-based chatbots, support public understanding, adoption of Generative, including LLM-based, chatbots like ChatGPT

备注：

点击查看摘要

Abstract:The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

87. 【2605.20382】Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

链接：https://arxiv.org/abs/2605.20382

作者：Carolina Camassa,Derek Shiller

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：powerful pattern completers, pattern completers, powerful pattern, models, follow instructions

备注： 31 pages

点击查看摘要

Abstract:Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

88. 【2605.20369】DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

链接：https://arxiv.org/abs/2605.20369

作者：Zhaohui Zheng,Chenhang He,Shihao Wang,Yuxuan Li,Ming-Ming Cheng,Lei Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language models, Number Token Loss, Number prediction stands, Discretized Distance Loss, language models

备注：

点击查看摘要

Abstract:Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at this https URL

89. 【2605.20364】When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

链接：https://arxiv.org/abs/2605.20364

作者：Jinlong Liu,Mohammed Bahja,Mark Lee

类目：Computation and Language (cs.CL)

关键词：fully capture creativity-related, capture creativity-related dimensions, Automatic evaluation, long-form literary writing, originality and flexibility

备注： Submit to EMNLP 2026

点击查看摘要

Abstract:Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

90. 【2605.20356】Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

链接：https://arxiv.org/abs/2605.20356

作者：Pablo Riera,Pablo Brusco,Cristina Kuo,Marcelo Sancinetti,S.R.K. Branavan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：enabling interaction dynamics, interaction dynamics closer, Full-duplex spoken dialogue, speak simultaneously, turn-based systems

备注：

点击查看摘要

Abstract:Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

91. 【2605.20315】Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

链接：https://arxiv.org/abs/2605.20315

作者：Haiquan Lu,Zigeng Chen,Gongfan Fang,Xinyin Ma,Xinchao Wang

类目：Computation and Language (cs.CL)

关键词：memory retrieval, solving complex tasks, multi-step interaction, recently emerged, powerful paradigm

备注：

点击查看摘要

Abstract:LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

92. 【2605.20268】Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

链接：https://arxiv.org/abs/2605.20268

作者：Paul Quinlan,Jeremy Levasseur,Qingguo Li,Xiaodan Zhu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Real-world time series, Real-world time, time series, Real-world, time

备注：

点击查看摘要

Abstract:Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

93. 【2605.20247】CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

链接：https://arxiv.org/abs/2605.20247

作者：Yang Liu,Toan Nguyen,Flora D. Salim

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Catastrophic forgetting remains, continual learning, Catastrophic forgetting, remains a major, major obstacle

备注：

点击查看摘要

Abstract:Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

94. 【2605.20244】Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

链接：https://arxiv.org/abs/2605.20244

作者：Jialin Lu,Soonho Kong,Rodrigo Stehling,Kaiyu Yang,Zhangyang Wang,Weiran Sun,Wuyang Chen

类目：Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：present Lean Refactor, retrieval-augmented agentic framework, Lean Refactor, Lean, Lean Refactor steers

备注：

点击查看摘要

Abstract:We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

95. 【2605.20241】Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

链接：https://arxiv.org/abs/2605.20241

作者：Woo Seob Sim,Yu Rang Park

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, strong average detection, unsafe prompts, large language, language models

备注：

点击查看摘要

Abstract:Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

96. 【2605.20202】Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

链接：https://arxiv.org/abs/2605.20202

作者：Rana Muhammad Usman

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：locally deployed language, emotionally framed evaluation, framed evaluation follow-ups, evaluation follow-ups change, calm-relative internal representations

备注： 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: [this https URL](https://github.com/ranausmanai/LLMEmotionGeometry)

点击查看摘要

Abstract:I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

97. 【2605.20201】Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

链接：https://arxiv.org/abs/2605.20201

作者：Miao Li,Irina Saparina,Alexander Gurung,Mirella Lapata

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Recent large language, Recent large, million tokens, require complex reasoning, large language models

备注： Long, ACL 2026 (Main conference)

点击查看摘要

Abstract:Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

98. 【2605.20199】FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

链接：https://arxiv.org/abs/2605.20199

作者：Runzhe Zhang,Letian Chen,Wenpeng Zhang,Zhouhan Lin,Peilin Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：language model transformed, pre-trained diffusion language, efficient fine-tuning, flow matching language, diffusion language models

备注： 26 pages, 11 figures

点击查看摘要

Abstract:We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.

99. 【2605.20197】MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

链接：https://arxiv.org/abs/2605.20197

作者：Zhichao Yang,Gregory D. Lyng,Sanjit Singh Batra,Robert E. Tillman

类目：Computation and Language (cs.CL)

关键词：electronic health records, health records underpins, Medical concept extraction, concept extraction, Medical

备注：

点击查看摘要

Abstract:Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

100. 【2605.20196】Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

链接：https://arxiv.org/abs/2605.20196

作者：Zihui Song,Shihao Ji,Hongxi Li,Shuaizhi Cheng,Chunlin Huang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：real-data scaling laws, latent predictive contribution, predictive contribution spectrum, predictive contribution, investigate the hypothesis

备注： 8 pages,6 figures

点击查看摘要

Abstract:We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.

101. 【2605.20195】Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

链接：https://arxiv.org/abs/2605.20195

作者：Xinyue Kang,Maodong Li,Yibin Zheng,Fang Kong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：actively providing suggestions, providing suggestions, dialogue, designed to steer, steer conversations

备注： ICASSP2026

点击查看摘要

Abstract:A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.

102. 【2605.20194】Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

链接：https://arxiv.org/abs/2605.20194

作者：Aisvarya Adeseye,Jouni Isoaho,Adeyemi Adeseye

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, Large language, Large, long documents, analyzing long documents

备注： Accepted to be Published in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

103. 【2605.20193】Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

链接：https://arxiv.org/abs/2605.20193

作者：Aisvarya Adeseye,Jouni Isoaho,Adeyemi Adeseye

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Quantized Large Language, Quantized Large, fewer computing resources, Large Language Models, Large Language

备注： Accepted to publish in 12th Intelligent Systems Conference 2026; 3-4 September 2026 in Amsterdam, The Netherlands

点击查看摘要

Abstract:Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

104. 【2605.20192】Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

链接：https://arxiv.org/abs/2605.20192

作者：Xintong Wu,Peiting Tsai,Jing Yuan,Michael Yu,Greg Sun,Luyao Zhang

类目：Computation and Language (cs.CL)

关键词：expanding Metaverse ecosystem, native MANA token, reality platform operating, Metaverse ecosystem, expanding Metaverse

备注：

点击查看摘要

Abstract:Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

105. 【2605.20191】Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs

链接：https://arxiv.org/abs/2605.20191

作者：Marco Bombieri,Simone Paolo Ponzetto,Marco Rospocher

类目：Computation and Language (cs.CL)

关键词：Modern Large Language, Large Language Models, Modern Large, Large Language, simulate human behavior

备注： Accepted for publication in ACM Transactions on Intelligent Systems and Technology

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

106. 【2602.08028】Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

链接：https://arxiv.org/abs/2602.08028

作者：Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：recent methods guide, large language models, methods guide large, guide large language, paths in standard

备注： Accepted to Findings of IJCNLP-AACL 2025

点击查看摘要

Abstract:To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

信息检索

1. 【2605.21057】SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law

链接：https://arxiv.org/abs/2605.21057

作者：Shannon Lee Yueh Ern,Kaidong Feng,Yingpeng Du,Chloe Lee En Jia,Zhu Sun

类目：Information Retrieval (cs.IR)

关键词：common-law systems depends, legal citation retrieval, Legal citation, factual similarity, Legal

备注：

点击查看摘要

Abstract:Legal citation in common-law systems depends not only on factual similarity, but also on the legal principle for which a precedent is invoked. However, existing benchmarks for legal citation retrieval use case facts, citation context, or full judgments as inputs, where the governing legal principle is often missing or only implicitly expressed and entangled with broader context. As a result, models may retrieve precedents that are factually similar yet doctrinally irrelevant. This limitation is particularly consequential in Singapore, where the legal system has evolved independently: only domestic precedents are binding, while foreign authorities serve merely as persuasive references. Thus, we propose a new retrieval paradigm that ranks cited cases based on queries integrating case facts and explicit legal principles, inspired by real-world legal reasoning workflows. To support this paradigm, we introduce SG-LegalCite, a dataset of 100,890 case-principle pairs extracted from 8,523 Singapore Supreme Court judgments spanning from 2000 to 2025. Experiments across 11 baselines demonstrate the effectiveness of our principle-augmented retrieval paradigm, showing that explicit legal principles provide strong discriminative signals for legal citation retrieval.

2. 【2605.20926】MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts

链接：https://arxiv.org/abs/2605.20926

作者：Zhen Tao,Jinxiang Zhao,Peng Liu,Dinghao Xi,Yanfang Chen,Wei Xu,Zhiyu Li

类目：Information Retrieval (cs.IR)

关键词：large language models, enable conversational agents, conversational agents based, apply user-specific information, systems enable conversational

备注：

点击查看摘要

Abstract:Long-term memory systems enable conversational agents based on large language models (LLMs) to retain, retrieve, and apply user-specific information across multi-session interactions. However, existing evaluations mainly assess outcome-level performance or temporal updating, providing limited insight into how systems retrieve and rank temporally valid, factually correct, and contextually applicable memory evidence under conflicting alternatives. To address this gap, we propose MemConflict, a diagnostic framework that treats memory validity as a query-conditioned fitness-for-use problem. MemConflict formalizes dynamic, static, and conditional conflicts over temporal validity, factual correctness, and contextual applicability. It simulates controlled long-horizon histories from structured user profiles, introduces cross-session conflicts, and injects semantically similar distractors to create competition among memory candidates. The resulting multi-session dialogue benchmark supports black-box evaluation of final answers and white-box analysis of supporting-memory retrieval and ranking. Experiments on six representative long-term memory systems show uneven strengths across conflict types, with answer correctness often diverging from memory retrieval and ranking. Sensitivity analyses reveal that longer histories, distractors, implicit queries, and larger conflict distances degrade performance. Diagnostics show failures from missing supporting memories and ineffective use of retrieved memories. Collectively, MemConflict advances principled long-term memory governance through retrieval-aware, conflict-aware reliability assessment.

3. 【2605.20815】GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

链接：https://arxiv.org/abs/2605.20815

作者：Peter Fernandes,Ria Kanjilal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Graph-based Retrieval Augmented, Retrieval Augmented Generation, extends retrieval-augmented generation, Augmented Generation, Electronic Health Record

备注： 9 pages, 1 figure, 5 tables

点击查看摘要

4. 【2605.20724】CALMem : Application-Layer Dual Memory for Conversational AI

链接：https://arxiv.org/abs/2605.20724

作者：Rajendra Narayan Jena,Rajan Padmanabhan,Sankar Arumugam

类目：Information Retrieval (cs.IR)

关键词：Large language models, Large language, fundamentally limit conversational, operate within fixed, limit conversational continuity

备注：

点击查看摘要

Abstract:Large language models (LLMs) operate within fixed context windows that fundamentally limit conversational continuity. When context fills, compaction discards history irreversibly; when sessions end, all memory resets to zero. Existing solutions-larger context windows, retrieval-augmented generation for knowledge bases, and memory-augmented architectures such as MemGPT-either require model modification, impose provider lock-in, or do not address the compaction continuity problem. We present CALMem (Conversational Application-Layer Memory), an application-layer dual memory architecture that gives LLM-based conversational assistants virtually unbounded effective context without any modification to the underlying model. CALMem combines two complementary memory subsystems: an episodic memory layer built on sliding-window vector embeddings of conversation history, and a semantic memory layer of agent-writable structured facts. A token-budget-adaptive injection mechanism, called the MOIM (Message of Injected Memory), automatically retrieves and injects relevant past context each turn, scaling injection depth inversely with context pressure. A key contribution is intra-session retrieval: compacted away turns from the current session remain searchable, closing a gap unaddressed by prior work. The system is implemented as a pure application layer in a production Rust codebase, is provider-agnostic, and degrades to original LLM behaviour with zero overhead when disabled. We describe the architecture, design decisions, and performance characteristics, and analyse the trade-offs that guided each implementation choice.

5. 【2605.20689】DIVE: Embedding Compression via Self-Limiting Gradient Updates

链接：https://arxiv.org/abs/2605.20689

作者：Dongfang Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：vector search systems, large language models, language models impose, models impose significant, impose significant storage

备注：

点击查看摘要

6. 【2605.20683】Layer-wise Token Compression for Efficient Document Reranking

链接：https://arxiv.org/abs/2605.20683

作者：Shengyao Zhuang,zhichao Xu,Ivano Lauriola

类目：Information Retrieval (cs.IR)

关键词：information retrieval systems, modern information retrieval, Transformer-based document cross-encoder, Transformer-based document, retrieval systems

备注： SIGIR2026 short paper

点击查看摘要

Abstract:Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that compression at middle layers preserves ranking quality while increasing inference QPS by up to 25% for passage ranking and up to 116% for document ranking. We also extend LTC to listwise LLM rerankers and show that the same approach can be easily applied to long-context listwise reranking, where the QPS improvements are even greater. More surprisingly, when applying rerankers trained on short passages to long-document ranking tasks, models trained with compression outperform their uncompressed counterparts, suggesting that compression may act as a beneficial regularizer that encourages length-invariant representations.

7. 【2605.20254】Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

链接：https://arxiv.org/abs/2605.20254

作者：Amritansh Maurya,Navjot Singh,Mohammed Javed,Omar Moured

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, requires precise cell, shown promising results, precise cell retrieval

备注： Accepted for Presentation in ICDAR 2026, Vienna, Austria

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

8. 【2605.20220】Advanced Scientific Methodology Plays Rossini

链接：https://arxiv.org/abs/2605.20220

作者：Silvia Licciardi,Daniela Macchione,Emmanuel Caronna,Elisa Francomano

类目：ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：times implicit, composer intentions, musical score, essential instructions, Music Philology

备注：

点击查看摘要

Abstract:A musical score provides the essential instructions for its performance while containing indications - at times implicit - regarding the composer's intentions. The presence of authorial variants, and even more so complex series of revisions associated with a single text, presents a challenging path for analytical study. This research, situated within the application of Scientific Methodologies to Music Philology, proposes a methodological approach oriented toward the structural analysis of one of the many settings composed by Gioachino Rossini on the same Metastasio arietta ``Mi lagnerò tacendo''. Through Computational Analysis - incorporating parsing, data mining, and graph theory - the melodic, harmonic, and textual compositional choices have been rigorously explored. The results constitute a significant unicum in the field, laying the foundation for a systematic study that supports philological research and paves the way for the use of generative models to investigate the creative process.

计算机视觉

1. 【2605.21489】Variance Reduction for Expectations with Diffusion Teachers

链接：https://arxiv.org/abs/2605.21489

作者：Jesse Bettencourt,Xindi Wu,Matan Atzmon,James Lucas,Jonathan Lorraine

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)

关键词：Pretrained diffusion models, diffusion models serve, frozen teachers feeding, Pretrained diffusion, teachers feeding downstream

备注： Project page: [this https URL](https://research.nvidia.com/labs/sil/projects/CARV/)

点击查看摘要

Abstract:Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2. 【2605.21487】Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

链接：https://arxiv.org/abs/2605.21487

作者：Dian Zheng,Manyuan Zhang,Hongyu Li,Hongbo Liu,Kai Zou,Kaituo Feng,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enhancing Unified Multimodal, Unified Multimodal Models, Unified Multimodal, enhancing Unified, Multimodal Models

备注： Project Page: [this https URL](https://zhengdian1.github.io/Uni-Edit-proj/) Code: [this https URL](https://github.com/zhengdian1/Uni-Edit)

点击查看摘要

Abstract:Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

3. 【2605.21484】One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

链接：https://arxiv.org/abs/2605.21484

作者：Chaoyang Wang,Yunhai Tong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion models excel, Discrete diffusion models, iterative decoding, rely on slow, diffusion models

备注：

点击查看摘要

Abstract:Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

4. 【2605.21479】WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

链接：https://arxiv.org/abs/2605.21479

作者：Basel Shbita,Pengyuan Li,Anna Lisa Gentile

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：largely emphasized perception-based, emphasized perception-based tasks, Visual Question Answering, Question Answering, largely emphasized

备注：

点击查看摘要

Abstract:Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

5. 【2605.21478】Latent Dynamics for Full Body Avatar Animation

链接：https://arxiv.org/abs/2605.21478

作者：Shichong Peng,Chengxiang Yin,Fei Jiang,Zhongshi Jiang,Lingchen Yang,Qingyang Tan,Amin Jourabloo,Jason Saragih,Ke Li,Christian Häne

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Pose-driven full-body avatars, Pose-driven full-body, full-body avatars built, neural rendering produce, rendering produce high-quality

备注： Supplementary video: [this https URL](https://youtu.be/xjnr3YM0yIE)

点击查看摘要

Abstract:Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

6. 【2605.21472】Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

链接：https://arxiv.org/abs/2605.21472

作者：Kaichen Zhou,Zeyang Bai,Xinhai Chang,Mengyu Wang,Paul Liang,Fangneng Zhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce high-quality object, high-quality object reconstructions, real-world visual observation, produce high-quality, single view

备注： Multi-view 3D Generation, Streaming 3D Generation

点击查看摘要

Abstract:View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: this https URL.

7. 【2605.21466】StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

链接：https://arxiv.org/abs/2605.21466

作者：Guanlong Jiao,Chenyangguang Zhang,Jia Jun Cheng Xian,Zewei Zhang,Renjie Liao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：satisfying editing results, generally feasible, video editing, require many costly, costly iterations

备注： Project Page: [this https URL](https://dsl-lab.github.io/StreamGVE/)

点击查看摘要

Abstract:Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

8. 【2605.21454】ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

链接：https://arxiv.org/abs/2605.21454

作者：Amaya Gallagher-Syed,Costantino Pitzalis,Myles J. Lewis,Michael R. Barnes,Gregory Slabaugh

类目：Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)

关键词：biologically grounded representations, encoders producing biologically, producing biologically grounded, introduce ProtoPathway, imaging and transcriptomics

备注： Currently under peer review

点击查看摘要

Abstract:We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: this https URL.

9. 【2605.21443】mpGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

链接：https://arxiv.org/abs/2605.21443

作者：Yakun Yu,Ashley Wiens,Adrián Barahona-Ríos,Benedict Wilkins,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：game quality assurance, Vision-language models, quality assurance, video game quality, increasingly being explored

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

10. 【2605.21440】ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes

链接：https://arxiv.org/abs/2605.21440

作者：Zhiming Liu,Zhicheng Zou,Nantheera Anantrasirichai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：posing significant challenges, Atmospheric turbulence severely, severely degrades video, turbulence severely degrades, posing significant

备注：

点击查看摘要

Abstract:Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.

11. 【2605.21431】ryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

链接：https://arxiv.org/abs/2605.21431

作者：Jun Zheng,Zhengze Xu,Mengting Chen,Jing Wang,Jinsong Lan,Xiaoyong Zhu,Kaifu Zhang,Bo Zheng,Xiaodan Liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video Virtual Try-On, aims to seamlessly, Virtual Try-On, seamlessly replace, Interactive Video Virtual

备注： Project Page: [this https URL](https://zhengjun-ai.github.io/itryon-page) . Accepted by ICML 2026

点击查看摘要

Abstract:Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

12. 【2605.21421】AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

链接：https://arxiv.org/abs/2605.21421

作者：Lauhitya Reddy,Trisha M. Kesar,Hyeokhyen Kwon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：measuring human movement, technical complexity, gold standard, standard for measuring, measuring human

备注： 18 pages 3 figures, 2 tables

点击查看摘要

Abstract:Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

13. 【2605.21418】FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

链接：https://arxiv.org/abs/2605.21418

作者：Amin Farajzadeh,Melike Erol-Kantarci

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)

关键词：aggressive frequency reuse, making multi-cell orthogonal, frequency-division multiple access, frequency reuse amplifies, reuse amplifies inter-cell

备注： Submitted to IEEE for possible publication

点击查看摘要

Abstract:In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management -- joint subcarrier scheduling and power allocation -- under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

14. 【2605.21417】Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

链接：https://arxiv.org/abs/2605.21417

作者：Junghyun Lee,Hyunseo Kim,Hanna Jang,Junhyug Noh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single dominant signal, overlapping multimodal cues, dominant signal, expressed as mixtures, mixtures of subtle

备注： Accepted at IEEE FG 2026. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

点击查看摘要

Abstract:Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

15. 【2605.21414】PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

链接：https://arxiv.org/abs/2605.21414

作者：Shizhe Chen,Paul Pacaud,Cordelia Schmid

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：shown strong potential, general-purpose robotic manipulation, leveraging large pretrained, models have shown, shown strong

备注： Accepted to RSS 2026; project webpage: [this https URL](https://cshizhe.github.io/projects/pointact.html)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

16. 【2605.21411】RoadTones: Tone Controllable Text Generation from Road Event Videos

链接：https://arxiv.org/abs/2605.21411

作者：Chirag Parikh,Siddhi Pravin Lipare,Ravi Kiran Sarvadevabhatla

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing video-language models, Existing video-language, Existing, events are expressed, road events

备注： Accepted at CVPR Findings 2026. Project page: [this https URL](https://roadtones.github.io/)

点击查看摘要

Abstract:Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

17. 【2605.21381】Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

链接：https://arxiv.org/abs/2605.21381

作者：Yi Liu,Jia Ma,Wengen Li,Jihong Guan,Shuigeng Zhou,Yichao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Image Restoration, Flow Matching, synthesizing realistic textures, Recent advances, advances in Image

备注： 44 pages, 16 figures, 16 tables

点击查看摘要

Abstract:Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

18. 【2605.21372】Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

链接：https://arxiv.org/abs/2605.21372

作者：Hongzhi Ruan,Pei Liu,Weiliang Ma,Zhengning Li,Xueyang Zhang,Jun Ma,Dan Xu,Kun Zhan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：modern deep learning, grows increasingly critical, deep learning, autonomous driving shifts, Data

备注：

点击查看摘要

Abstract:Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

19. 【2605.21371】A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

链接：https://arxiv.org/abs/2605.21371

作者：Leyue Tang,Jonathan Louis Bamber,Gang Qiao,Yuanhang Kong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Acquiring usable optical, frequent cloud cover, inherently challenging due, prolonged polar nights, Acquiring usable

备注： Submitted to IEEE JSTARS

点击查看摘要

Abstract:Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

20. 【2605.21343】OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

链接：https://arxiv.org/abs/2605.21343

作者：Ziye Li,Henghui Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, achieved remarkable, remarkable progress, Recent, spatial controllability

备注： ICML 2026, Project Page: [this https URL](https://henghuiding.com/OcclusionFormer/)

点击查看摘要

Abstract:Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

21. 【2605.21309】Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

链接：https://arxiv.org/abs/2605.21309

作者：Abhishek Dinkar Jagtap,Sanath Tiptur Sadashivaiah,Andreas Festag

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：communication enhances autonomous, shared sensory data, enhances autonomous driving, autonomous driving safety, unified environmental representation

备注： Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

点击查看摘要

Abstract:Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: this https URL

22. 【2605.21308】Deformba: Vision State Space Model with Adaptive State Fusion

链接：https://arxiv.org/abs/2605.21308

作者：Hongyu Ke,Jack Morris,Yongkang Liu,Satoshi Kitai,Kentaro Oguchi,Yi Ding,Haoxin Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：State Space Models, State Space, Space Models, alternative to Transformers, demonstrating linear-time complexity

备注：

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

23. 【2605.21301】Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

链接：https://arxiv.org/abs/2605.21301

作者：Robin Louiset,Edouard Duchesnay,Benoit Dufumier,Antoine Grigis,Pietro Gori

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：biomedical Subgroup Discovery, Contrastive Subgroup Discovery, Subgroup Discovery, Subgroup Discovery method, Deep UCSL

备注： Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track

点击查看摘要

Abstract:In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: this https URL.

24. 【2605.21300】Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

链接：https://arxiv.org/abs/2605.21300

作者：Meng Shen,Minghao Wu,Deepu Rajan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large vision-language models, Object hallucination, significant challenge, challenge that hinders, hinders the application

备注： 20 pages, 10 figures, 10 tables

点击查看摘要

Abstract:Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

25. 【2605.21280】Let EEG Models Learn EEG

链接：https://arxiv.org/abs/2605.21280

作者：Yifan Wang,Yijia Ma,Wen Li,Chenyu You

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：High-fidelity EEG generation, High-fidelity EEG, EEG generation, alleviating data scarcity, addressing privacy constraints

备注： Accepted by ICML 2026

点击查看摘要

Abstract:High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: this https URL .

26. 【2605.21273】DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

链接：https://arxiv.org/abs/2605.21273

作者：Weicheng Zheng,Yixin Huang,Qiao Sun,Derun Li,Hang zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：obtaining high-quality reasoning, high-quality reasoning annotations, long reasoning chains, understanding long reasoning, reasoning-centric interfaces face

备注：

点击查看摘要

Abstract:Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.

27. 【2605.21272】MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

链接：https://arxiv.org/abs/2605.21272

作者：Benjamin Aubin,Gonzalo Iñaki Quintana,Onur Tasar,Sanjeev Sreetharan,Urszula Czerwinska,Damien Henry,Clément Chadebec

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：models requires high-quality, Training large, requires high-quality, detailed captions, diverse content

备注：

点击查看摘要

Abstract:Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

28. 【2605.21268】Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

链接：https://arxiv.org/abs/2605.21268

作者：Arun D. Kulkarni

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, sensing imagery plays, Vision Transformers, remote sensing land, sustainable resource management

备注： 12 pages

点击查看摘要

Abstract:Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

29. 【2605.21261】STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

链接：https://arxiv.org/abs/2605.21261

作者：Miaoge Li,Dongsheng Wang,Zening Sun,Jinsen Zhang,Wenhan Luo,Jingcai Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently gaining increasing, gaining increasing research, increasing research interest, research interest due, unseen multimodal retrieval

备注：

点击查看摘要

Abstract:Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

30. 【2605.21244】SR-Ground: Image Quality Grounding for Super-Resolved Content

链接：https://arxiv.org/abs/2605.21244

作者：Artem Borisov,Evgeney Bogatyrev,Khaled Abud,Dmitriy Vatolin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving unprecedented fidelity, diffusion-based models achieving, models achieving unprecedented, Image Quality Assessment, recent years

备注：

点击查看摘要

Abstract:Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.21244 [cs.CV]

(or
arXiv:2605.21244v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.21244

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

31. 【2605.21237】RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

链接：https://arxiv.org/abs/2605.21237

作者：Xuan Yang,Xiaohan Yuan,Hao Li,Lingyu Chen,Yanan Liu,Lei Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：quantifying regional function, cycle is crucial, crucial for quantifying, strongly affected, Cardiac Motion Synthesis

备注： Early Accepted by MICCAI 2026. This is the author's submitted version. 10 pages, 3 figures

点击查看摘要

Abstract:Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

32. 【2605.21207】PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

链接：https://arxiv.org/abs/2605.21207

作者：Xiaoyu Zhou,Jianwei Fei,Peipeng Yu,Jingchang Xie,Chong Cheng,Zhihua Xia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern diffusion models, rapid evolution, evolution of generative, GANs to modern, modern diffusion

备注：

点击查看摘要

Abstract:The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at this https URL.

33. 【2605.21195】RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

链接：https://arxiv.org/abs/2605.21195

作者：Siyong Jian,Siyuan Li,Luyuan Zhang,Zedong Wang,Xin Jin,Ying Li,Cheng Tan,Huan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：current post-training pipelines, post-training pipelines optimize, Latent Covariate Shift, pipelines optimize, Discrete autoregressive

备注：

点击查看摘要

Abstract:Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

34. 【2605.21190】Semantic Granularity Navigation in Image Editing

链接：https://arxiv.org/abs/2605.21190

作者：Liangsi Lu,Minzhe Guo,Xuhang Chen,Yang Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-image editing remains, editing remains constrained, real-image editing, structural fidelity, generative capabilities

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

35. 【2605.21186】SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection

链接：https://arxiv.org/abs/2605.21186

作者：Wanying Tan,Shuo Yan,Dazhi Huang,Yazheng Liu,Zili Shao,Rufeng Chen,Hechang Chen,Mude Shi,Tianxing Ji,Sihong Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：clinical auxiliary diagnosis, crucial confidence support, auxiliary diagnosis, crucial confidence, confidence support

备注： 10 pages, 4 figures, conference paper

点击查看摘要

Abstract:Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.

36. 【2605.21182】Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

链接：https://arxiv.org/abs/2605.21182

作者：Jeonghun Baek,Atsuyuki Miyai,Shota Onohara,Hikaru Ikuta,Kiyoharu Aizawa

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Japanese popular culture, forms of Japanese, Japanese popular, culturally distinctive multimodal, distinctive multimodal medium

备注： Accepted to the Culture x AI Workshop at ICML 2026. Project page: [this https URL](https://manga109.github.io/manga109-project-website/en/)

点击查看摘要

37. 【2605.21171】FTerViT: Fully Ternary Vision Transformer

链接：https://arxiv.org/abs/2605.21171

作者：Szymon Ruciński,Pietro Bonazzi,Engin Türetken,Simon Narduzzi,Michele Magno,Nadim Maamari

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Transformers offer substantial, Vision Transformers offer, offer substantial model, Ternary Vision Transformers, encoder layers

备注： Preprint

点击查看摘要

Abstract:Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.

38. 【2605.21157】Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

链接：https://arxiv.org/abs/2605.21157

作者：Sourov Roy Shuvo,Prajwal Panth,Rajesh Chowdhury,Sorup Chakraborty,Sudip Chakrabarty,Prasant Kumar Pattnaik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：modern warfare, essential part, part of intelligence, intelligence gathering, gathering and carrying

备注： 6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication

点击查看摘要

Abstract:In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

39. 【2605.21139】Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

链接：https://arxiv.org/abs/2605.21139

作者：Yang Wu,Qiang Meng,Zhaojiang Liu,Youquan Liu,Jian Yang,Jin Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：behavioral cloning ceiling, fundamentally constrained, behavioral cloning, cloning ceiling, ceiling of imitation

备注：

点击查看摘要

Abstract:Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

40. 【2605.21132】SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

链接：https://arxiv.org/abs/2605.21132

作者：Jingyi He,Yue Zhou,Long Bai,Kun Yuan,Nassir Navab,Yuan Bi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：systems continuously perceive, intelligent surgical embodiment, surgery proceeds, real time, time is fundamental

备注：

点击查看摘要

Abstract:Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

41. 【2605.21131】UniT: Unified Geometry Learning with Group Autoregressive Transformer

链接：https://arxiv.org/abs/2605.21131

作者：Haotian Wang,Yusong Huang,Zhaonian Kuang,Hongliang Lu,Xinhu Zheng,Meng Yang,Gang Hua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent feed-forward models, Recent feed-forward, significantly advanced geometry, inferring dense, advanced geometry perception

备注： Submitted to IEEE T-PAMI

点击查看摘要

Abstract:Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

42. 【2605.21130】VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

链接：https://arxiv.org/abs/2605.21130

作者：Shibei Meng,Binxin Yang,Yuan Liu,Jiexuan Zhang,Zhengyao Lv,Hubery Yin,Qiang Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Multimodal Models, Large Multimodal, Multimodal Models, shown promise, video quality assessment

备注：

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

43. 【2605.21123】Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

链接：https://arxiv.org/abs/2605.21123

作者：Kesong Li,Yixuan Xu,Kuo-kun Tseng,Weiyi Lu,Kan Liu,Tao Lan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Direct Preference Optimization, Direct Preference, Preference Optimization, DPO objective, generalized DPO objective

备注： Code and models are available at: [this https URL](https://github.com/Whynot0101/Linear-DPO) . Work done during an internship at Alibaba Group

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

44. 【2605.21121】ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

链接：https://arxiv.org/abs/2605.21121

作者：Hanxiao Sun,Mingxin Yang,Shuhui Yang,Zebin He,Xintong Han,Hongbo Fu,Chunchao Guo,Wenhan Luo

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：produce high-quality geometry, single view inevitably, unseen regions, produce high-quality, inevitably introduces ambiguity

备注：

点击查看摘要

Abstract:Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

45. 【2605.21112】RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

链接：https://arxiv.org/abs/2605.21112

作者：Weiyi Xiong,Bing Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cloud sparsity challenges, autonomous driving due, point cloud sparsity, object detection, cost and robustness

备注：

点击查看摘要

Abstract:4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

46. 【2605.21099】R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound

链接：https://arxiv.org/abs/2605.21099

作者：Yuanhan Wang,Yifei Chen,Beining Wu,Mingxuan Liu,Xiaotian Hu,Chunbo Jiang,Yijin Li,Changmiao Wang,Feiwei Qin,Qiyuan Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intrapartum transperineal ultrasound, remains highly sensitive, labor progression, Angle of Progression, Accurate estimation

备注： 11pages,4 figures,Accepted by MICCAI 2026

点击查看摘要

Abstract:Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at this https URL.

47. 【2605.21090】xtSculptor: Training and Benchmarking Scene Text Editing

链接：https://arxiv.org/abs/2605.21090

作者：Yiheng Lin,Siyu Jiao,Xiaohan Lan,Wei Zhou,Qi She,Fei Yu,Heyun Chen,Zhengwei Wang,Jinghuan Chen,Moran Li,Yingchen Yu,Zijian Feng,Yao Zhao,Yunchao Wei,Yujie Zhong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, substantially improved prompt-driven, Recent advances

备注：

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at this https URL.

48. 【2605.21079】VDFP: Video Deflickering with Flicker-banding Priors

链接：https://arxiv.org/abs/2605.21079

作者：Zhiyi Zhou,Libo Zhu,Zihan Zhou,Yulun Zhang,Xiaokang Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Capturing digital screens, hardware synchronization mismatches, smartphones frequently induces, frequently induces severe, Capturing digital

备注： [this https URL](https://github.com/ZhiyiZZhou/VDFP)

点击查看摘要

Abstract:Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available this http URL we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at~ this https URL.

49. 【2605.21075】SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

链接：https://arxiv.org/abs/2605.21075

作者：Nassim Ait Ali Braham,Aaron Banze,Conrad M. Albrecht,Julien Mairal,Jocelyn Chanussot,Xiao Xiang Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：synthetic aperture radar, derived geospatial layers, spanning multispectral imagery, Earth observation, foundation models

备注：

点击查看摘要

Abstract:Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

50. 【2605.21072】Q-ARVD: Quantizing Autoregressive Video Diffusion Models

链接：https://arxiv.org/abs/2605.21072

作者：Siao Tang,Xinyin Ma,Gongfan Fang,Xingyi Yang,Xinchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-time interactive video, streaming video generation, interactive video generation, video diffusion models, Autoregressive video diffusion

备注： Code: [this https URL](https://github.com/tsa18/Q-ARVD)

点击查看摘要

Abstract:Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

51. 【2605.21061】Grounding Driving VLA via Inverse Kinematics

链接：https://arxiv.org/abs/2605.21061

作者：Junsung Park,Hyunjung Shim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：ill-posed task formulation, structurally ill-posed task, Existing Driving VLAs, existing VLAs supply, VLAs predict trajectories

备注：

点击查看摘要

Abstract:Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

52. 【2605.21059】Multimodal LLMs under Pairwise Modalities

链接：https://arxiv.org/abs/2605.21059

作者：Yan Li,Yunlong Deng,Yuewen Sun,Gongxu Luo,Kun Zhang,Guangyi Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：requiring substantial human, impressive results achieved, substantial human effort, multimodal large language, jointly curated multimodal

备注：

点击查看摘要

Abstract:Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

53. 【2605.21042】Dynamic Video Generation: Shaping Video Generation Across Time and Space

链接：https://arxiv.org/abs/2605.21042

作者：Shikang Zheng,Jingkai Huang,Jiacheng Liu,Guantao Chen,Lixuan,Yuqi Lin,Peiliang Cai,Linfeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved impressive performance, iterative denoising process, computationally expensive due, denoising process remains, process remains computationally

备注：

点击查看摘要

Abstract:Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

54. 【2605.21032】owards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

链接：https://arxiv.org/abs/2605.21032

作者：Bowyn Tan,Yutong Xie,Bai Huang,Fan Luo,Xiao Li,Naizheng Wang,Yang Guan,Shengbo Eben Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：High-fidelity street scene, High-fidelity street, autonomous driving simulation, street scene reconstruction, autonomous driving

备注： 20 pages, 4 figures

点击查看摘要

Abstract:High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

55. 【2605.21028】DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

链接：https://arxiv.org/abs/2605.21028

作者：Bo Ye,Xinyu Cui,Jian Zhao,Tong Wei,Min-Ling Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：typically combining local, Autoregressive long video, adopts bounded-memory streaming, combining local windows, Autoregressive long

备注：

点击查看摘要

Abstract:Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at this https URL.

56. 【2605.21007】LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

链接：https://arxiv.org/abs/2605.21007

作者：Daojie Peng,Bingtao Wang,Fulong Ma,Liang Zhang,Jun Ma

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：resource-constrained edge devices, fundamental perception task, Road segmentation, multi-modal road segmentation, requiring both high

备注：

点击查看摘要

Abstract:Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

57. 【2605.21002】Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

链接：https://arxiv.org/abs/2605.21002

作者：Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov,Nurana Abdullayeva

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Multimedia (cs.MM)

关键词：synthesizes photorealistic imagery, traditional forensic intuition, defeats traditional forensic, Generative artificial intelligence, Generative artificial

备注： 13 pages, 4 figures, 10 tables. Submitted to IEEE Transactions on Information Forensics and Security

点击查看摘要

Abstract:Generative artificial intelligence now synthesizes photorealistic imagery, audio, and video at a cost that defeats traditional forensic intuition. The legal consequences span three regimes studied so far in isolation: international operational law, domestic procedure, and product regulation. This article presents a unified evidentiary framework that maps cryptographic content provenance, robust statistical watermarking, and zero knowledge attestation to the proof requirements of each regime. We define a five tier threat model spanning naive regeneration, adversarial laundering, cross model regeneration, active watermark removal, and insider provenance forgery. We release a public benchmark of 12000 generated items across image, audio, and video modalities under six laundering pipelines for 72000 evaluation samples. We evaluate four representative schemes and report true positive rate at fixed false positive rate, robustness area under the curve, computational overhead, and a regime conditioned legal sufficiency score. We translate empirical detection bounds into legal sufficiency thresholds for command decisions under the law of armed conflict, for criminal and civil admissibility under domestic procedure, and for persistence audits under the European Union Artificial Intelligence Act and analogous regimes. The result is a reproducible reference pipeline, a public benchmark, and model annexes that lawyers, engineers, and operators can deploy together.

58. 【2605.21001】DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

链接：https://arxiv.org/abs/2605.21001

作者：Daniel Eskandar,Berna Kabadayi,Garvita Tiwari,Gerard Pons-Moll

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high visual fidelity, ignore geometric structure, avatar reconstruction method, Controllable Multi-layered Avatars, achieve high visual

备注：

点击查看摘要

Abstract:Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: this https URL

59. 【2605.20997】Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

链接：https://arxiv.org/abs/2605.20997

作者：Islam Mansour,Ronny Haensch,Irena Hajnsek,Konstantinos Papathanassiou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)

关键词：Integrating machine learning, retrieving geophysical parameters, remote sensing data, Integrating machine, retrieving geophysical

备注：

点击查看摘要

Abstract:Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

60. 【2605.20992】CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

链接：https://arxiv.org/abs/2605.20992

作者：Hao Xu,Yilin Liu,Yinqiao Wang,Chi-Wing Fu,Niloy J. Mitra

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：articulated hand motion, pose over time, turned into reusable, everyday open-world monocular, Contact-aware HOI Reconstruction

备注：

点击查看摘要

Abstract:We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

61. 【2605.20973】owards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

链接：https://arxiv.org/abs/2605.20973

作者：Dibyayan Patra,Simit Raval,Pasindu Ranasinghe,Bikram Banerjee,Ismet Canbulat

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：surrounding rock mass, rock bolt identification, installed rock bolts, rock bolt, rock

备注：

点击查看摘要

Abstract:The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

62. 【2605.20971】Comparative Evaluation of Deep Learning Models for Fake Image Detection

链接：https://arxiv.org/abs/2605.20971

作者：Akhitha Pakala,Mohammed Mahir Rahman,Shahzad Memon,Tauseef Ahmed

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词：manipulation presents significant, presents significant challenges, GAN-based image manipulation, image manipulation presents, digital forensics

备注： Accepted at ICCIIoT26 and waiting to be indexed

点击查看摘要

Abstract:The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

63. 【2605.20965】Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

链接：https://arxiv.org/abs/2605.20965

作者：Yutong Xie,Zhenglin Hua,Ran Wang,Wing W. Y. Ng,Xizhao Wang,Yuheng Jia

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Vision-Language Models, Large Vision-Language, shown remarkable performance, visual evidence, visual

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at this https URL.

64. 【2605.20963】owards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

链接：https://arxiv.org/abs/2605.20963

作者：Yihang Luo,Jun Chen,Chao Xiao,Yingqian Wang,Zhaoxu Li,Qiang Ling,Xu He,Nuo Chen,Gaowei Guo,Hongge Li,Miao Li,Longguang Wang,Yulan Guo,Li Liu,Wei An,Zhijie Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unmanned aerial vehicles, created urgent demand, aerial vehicles, proliferation of unmanned, unmanned aerial

备注： submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% = 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

65. 【2605.20961】Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

链接：https://arxiv.org/abs/2605.20961

作者：Zhangchi Hu,Wenzhang Sun,Xiangchen Yin,Jiahui Yuan,Chunfeng Wang,Hao Li,Kun Zhan,Xiaoyan Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：editing requires preserving, requires preserving source-observed, preserving source-observed regions, target plausible generation, models primarily target

备注： 23 pages, 13 figures

点击查看摘要

Abstract:Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: this https URL

66. 【2605.20955】DrawMotion: Generating 3D Human Motions by Freehand Drawing

链接：https://arxiv.org/abs/2605.20955

作者：Tao Wang,Lei Jin,Zhihua Wu,Qiaozhi He,Jiaming Chu,Yu Cheng,Junliang Xing,Jian Zhao,Shuicheng Yan,Li Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：translates textual descriptions, faces the challenge, descriptions into human, struggle to precisely, precisely convey

备注：

点击查看摘要

Abstract:Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at this https URL.

67. 【2605.20950】Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

链接：https://arxiv.org/abs/2605.20950

作者：Yulin Zhao,Yun Wang,Dehua Zheng,Borui jiang,Zheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：prohibitive computational costs, computational costs arising, visual token sequences, massive visual token, face a bottleneck

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

68. 【2605.20942】Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

链接：https://arxiv.org/abs/2605.20942

作者：Lena Wild,Katie Z Luo,Marco Pavone

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safe autonomous driving, traffic element relationships, lane geometry, autonomous driving, relationships is foundational

备注：

点击查看摘要

Abstract:Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

69. 【2605.20941】PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

链接：https://arxiv.org/abs/2605.20941

作者：Yunge Wen,Yuancheng Shen,Paul Pu Liang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

关键词：artistic behavior conditioned, evolving canvas states, neural painting assistant, open-ended autoregressive artistic, autoregressive artistic behavior

备注：

点击查看摘要

Abstract:We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2605.20941 [cs.CV]

(or
arXiv:2605.20941v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.20941

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

70. 【2605.20940】3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

链接：https://arxiv.org/abs/2605.20940

作者：Olivia Zumsteg,Jannis Widmer,Yann Bourdé,Norbert Kirchgessner,Andreas Hund,Lukas Roth,Paraskevi Nousi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：stress resilience assessment, measurement remains challenging, yield component analysis, field-based measurement remains, resilience assessment

备注： 8 pages, 6 figures (Appendix: 4 pages, 5 figures)

点击查看摘要

Abstract:Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

71. 【2605.20922】Winfree Oscillatory Neural Network

链接：https://arxiv.org/abs/2605.20922

作者：Jiawen Dai,Yue Song

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：widely believed, believed to play, play a fundamental, fundamental role, Oscillations and synchronization

备注： Project page: [this https URL](https://jiawen-dai.github.io/WONN_Project_Page/)

点击查看摘要

Abstract:Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

72. 【2605.20914】RISE: Reliable Improvement in Self-Evolving Vision-Language Models

链接：https://arxiv.org/abs/2605.20914

作者：Chaoran Xu,Yingmao Miao,Pengfei Zhang,Hao Dou,Lei Sun,Xiangxiang Chu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal reasoning capabilities, achieved strong multimodal, large-scale human-constructed supervision, strong multimodal reasoning, reasoning capabilities

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at this https URL.

73. 【2605.20910】FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

链接：https://arxiv.org/abs/2605.20910

作者：Jangho Park,Geon Yeong Park,Gihyun Kwon,Jong Chul Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long sequences remains, video diffusion models, important challenge, sequences remains, remains a long-standing

备注： Project Page: [this https URL](https://flowlong-video.github.io/)

点击查看摘要

Abstract:Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

74. 【2605.20908】SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

链接：https://arxiv.org/abs/2605.20908

作者：Tores Julie,Sun Rémy,Sassatelli Lucile,Ancarani Elisa,Wu Hui-Yin,Precioso Frédéric

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models provide interpretability, standard neural networks, offer strong task, offer strong, support test-time human

备注：

点击查看摘要

Abstract:Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

75. 【2605.20904】JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

链接：https://arxiv.org/abs/2605.20904

作者：Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Dongmei Jiang,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：JEPA-based Future Action, Action Anticipation method, Future Action Anticipation, Action Anticipation, JEPA-based Future

备注： The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at this https URL.

76. 【2605.20901】VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

链接：https://arxiv.org/abs/2605.20901

作者：Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Dongmei Jiang,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：V-JEPA Integrated StillFast, Object Interaction Anticipation, Integrated StillFast Temporal, StillFast Temporal Anticipator, Short-Term Object Interaction

备注： The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at this https URL.

77. 【2605.20892】FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

链接：https://arxiv.org/abs/2605.20892

作者：Enhui Yu,Junhui Li,Ruitong Lu,Jialu Li,Youshan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Fine-grained fruit classification, agricultural computer vision, high visual similarity, Fine-grained fruit, computer vision

备注： 10 pages,6 figures,submitted to CVPR 2026

点击查看摘要

Abstract:Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

78. 【2605.20891】HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

链接：https://arxiv.org/abs/2605.20891

作者：Huayi Wang,Haochao Ying,Yuyang Xu,Qiyao Zheng,jun wang,Cheng Zhang,Ying Sun,Jian Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal survival prediction, multimodal medical data, Slide Images, Genomic Profiles, accurate prognostic modeling

备注： 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

点击查看摘要

Abstract:Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) this http URL the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at this https URL.

79. 【2605.20889】Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

链接：https://arxiv.org/abs/2605.20889

作者：Hiroyuki Deguchi,Ryosuke Hori,Kotaro Amaya,Tsubasa Maruyama,Mitsunori Tada,Hideo Saito

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ubiquitous activity monitoring, essential for ubiquitous, ubiquitous activity, human pose estimation, absolute location

备注： Accepted at ICIP 2026, Project page: [this https URL](https://deguchihiroyuki.github.io/Map-Mono-Ego-Project/)

点击查看摘要

Abstract:Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

80. 【2605.20867】ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

链接：https://arxiv.org/abs/2605.20867

作者：Yingjia Xu,Jiulong Wu,Bowen Zhang,Baokui Guo,Siyuan Chai,Min Cao

类目：Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal sarcasm detection, perspectives needed vary, specific analytical perspectives, sarcasm detection requires, intended meaning

备注：

点击查看摘要

Abstract:Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

81. 【2605.20839】Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

链接：https://arxiv.org/abs/2605.20839

作者：Jeffrey Wang,Jonathan Gregory,Grigorios G. Chrysos

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：treat pointwise activations, Modern vision backbones, backbones treat pointwise, Modern vision, pointwise activations

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

82. 【2605.20838】USV: Towards Understanding the User-generated Short-form Videos

链接：https://arxiv.org/abs/2605.20838

作者：Haoyue Cheng,Su Xu,Liwei Jin,Wayne Wu,Chen Qian,Limin Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：user-generated short-form videos, User-generated Short-form Video, user-generated short-form, large-scale video datasets, advanced the area

备注：

点击查看摘要

Abstract:Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is this https URL.

83. 【2605.20837】ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

链接：https://arxiv.org/abs/2605.20837

作者：Qirui Shen,Wenda Wang,Jiachen Lu,Zilong Huang,Jin Bai,Lei He,Hongxuan Chen,Weixin Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Architectural spatial intelligence, infer architectural space, spatial intelligence, Architectural spatial, Architectural

备注： 51 pages

点击查看摘要

Abstract:Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at this https URL.

84. 【2605.20827】HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

链接：https://arxiv.org/abs/2605.20827

作者：Yaoyao Yue,Jérôme Schmid,Xiaoshuang Li,Eduardo Delamare,Jinman Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：routine dental care, Panoramic radiograph, craniofacial anatomy, dental care, projection of complex

备注： 10 pages

点击查看摘要

Abstract:Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

85. 【2605.20823】RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

链接：https://arxiv.org/abs/2605.20823

作者：Minh Anh Nguyen,Quang Huy Tran,Bao Ngoc Le,Tuan Kiet Pham,Sui Yang Guang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：flexible natural-language predicates, scene graph generation, graph generation seeks, describe object instances, scene graph

备注：

点击查看摘要

Abstract:Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

86. 【2605.20822】ERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

链接：https://arxiv.org/abs/2605.20822

作者：Jiae Yoon,Ue-Hwan Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Change Detection, challenge of Scene, Change Detection, Scene Change, Existing SCD models

备注： 8 pages, 4 figures. Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at this https URL.

87. 【2605.20821】VSCD: Video-based Scene Change Detection in Unaligned Scenes

链接：https://arxiv.org/abs/2605.20821

作者：Jiae Yoon,Ue-Hwan Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：assume fixed viewpoints, Scene Change Detection, detection settings assume, settings assume fixed, mild misalignment

备注： 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

88. 【2605.20820】AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

链接：https://arxiv.org/abs/2605.20820

作者：Zhaojie Zeng,Yuesong Wang,Yawei Luo,Tao Guan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：require costly per-image, efficient explicit representation, costly per-image iterative, representation for image, existing methods

备注： preprint version

点击查看摘要

Abstract:2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: this https URL

89. 【2605.20818】OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

链接：https://arxiv.org/abs/2605.20818

作者：Yisen Feng,Leigang Qu,Haoyu Zhang,Qiaohui Chu,Meng Liu,Xuemeng Song,Weili Guan,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Episodic Memory Challenge, Episodic Memory, Challenge at CVPR, Memory Challenge, Natural Language Queries

备注： Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at this https URL.

90. 【2605.20808】Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

链接：https://arxiv.org/abs/2605.20808

作者：Jinjin Zhang,Xiefan Guo,Di Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image synthesis relies, large-scale pre-trained Latent, synthesis relies heavily, Latent Diffusion Models, robust generative capacity

备注： Technical Report

点击查看摘要

Abstract:Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at this https URL.

91. 【2605.20807】Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

链接：https://arxiv.org/abs/2605.20807

作者：Hanzhong Guo,Yizhou Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserve high-frequency identity, high-frequency identity details, struggles to preserve, preserve high-frequency, high-frequency identity

备注：

点击查看摘要

Abstract:Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

92. 【2605.20804】OlmoEarth v1.1: A more efficient family of OlmoEarth models

链接：https://arxiv.org/abs/2605.20804

作者：Gabriel Tseng,Yawen Zhang,Favyen Bastani,Henry Herzog,Joseph Redmon,Hadrien Sablon,Piper Wolters,Patrick Alan Johnson,Christopher Wilhelm,Patrick Beukema

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：OlmoEarth family, present a set, reduction in GPU, times, Abstract

备注：

点击查看摘要

Abstract:We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at this http URL.

93. 【2605.20795】What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

链接：https://arxiv.org/abs/2605.20795

作者：Hangyu Lin,Chao Wen,Chengming Xu,Jianxiong Gao,Jiangning Zhang,Xiaobin Hu,Yanwei Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Flow matching based, Flow matching, prepended Vision-Language Models, handle complex, increasingly relying

备注：

点击查看摘要

Abstract:Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.

94. 【2605.20787】Findings of the Counter Turing Test: AI-Generated Image Detection

链接：https://arxiv.org/abs/2605.20787

作者：Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Amitava Das,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Stable Diffusion, synthetic visual content, rapid advancements, transformed the creation, Counter Turing Test

备注： Defactify4 @AAAI 2025

点击查看摘要

Abstract:The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

Comments:
Defactify4 @AAAI 2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.20787 [cs.CV]

(or
arXiv:2605.20787v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.20787

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

95. 【2605.20780】Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

链接：https://arxiv.org/abs/2605.20780

作者：Haozhe Jia,Pengyu Yin,Wenshuo Chen,Shaofeng Liang,Lei Wang,Bowen Tian,Xiucheng Wang,Nanqian Jia,Yutao Yue

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Physics-informed diffusion models, shifted boundary conditions, models typically enforce, enforce PDE constraints, typically enforce PDE

备注：

点击查看摘要

Abstract:Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [this https URL](this https URL).

96. 【2605.20777】AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

链接：https://arxiv.org/abs/2605.20777

作者：Manogna Sreenivas,Rohit Kumar,Soma Biswas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made impressive strides, made impressive, impressive strides, strides in maintaining, Visual storytelling

备注： Accepted at CVPR AIStory Workshop, 2026

点击查看摘要

Abstract:Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:this https URL

97. 【2605.20772】VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

链接：https://arxiv.org/abs/2605.20772

作者：Jiayi Chen,Benteng Ma,Zehui Liao,Winston Chong,Yasmeen George,Jianfei Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, medical Multimodal Large, Multimodal Large, Large Language

备注： Early accepted by MICCAI 2026

点击查看摘要

Abstract:While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at this https URL

98. 【2605.20766】Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

链接：https://arxiv.org/abs/2605.20766

作者：Zhu Liu,Yuanhang Yao,Ping Qian,Zihang Chen,Risheng Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low-contrast infrared imagery, infrared small target, severe sample-distribution imbalance, unstable pseudo-label evolution, small target detection

备注：

点击查看摘要

Abstract:Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at this https URL.

99. 【2605.20760】SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

链接：https://arxiv.org/abs/2605.20760

作者：K S Nithurshen,Saurabh J. Shigwan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Computed Tomography, Automated segmentation, column in Computed, surgical planning, vertebral column

备注： 2 Figures, 3 Tables

点击查看摘要

Abstract:Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

100. 【2605.20758】Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

链接：https://arxiv.org/abs/2605.20758

作者：Xuehui Yu,Fucheng Cai,Meiyi Wang,Xiaopeng Fan,Harold Soh

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Inference-time guided sampling, guided sampling steers, Inference-time guided, sampling steers, diffusion and flow

备注： Forty-Third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at this https URL.

101. 【2605.20743】Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

链接：https://arxiv.org/abs/2605.20743

作者：Juncheng Hu,Jiawei Du,Xin Zhang,Joey Tianyi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Vision-language models solve, solve geometry problems, drawing code carries, models solve geometry, intermediate states remain

备注：

点击查看摘要

102. 【2605.20738】STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

链接：https://arxiv.org/abs/2605.20738

作者：Yaoteng Zhang,Qing Zhou,Junyu Gao,Qi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote Sensing Incremental, continuous data streams, imagery typically arrives, Remote sensing imagery, sensing imagery typically

备注： STAR-IOD was accepted by ISPRS Journal of Photogrammetry and Remote Sensing

点击查看摘要

Abstract:Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: this https URL.

103. 【2605.20737】Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

链接：https://arxiv.org/abs/2605.20737

作者：Siqi Wei,Hongbin Xu,Feng Xiao,Tian Lan,Chun Li,Ming Li,Qiuxia Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：segmentation predominantly rely, fundamental limitation, purely visual similarity-based, predominantly rely, visual similarity-based

备注： In submission. The code will be released at: [this https URL](https://github.com/Whisky0129/langtail_official)

点击查看摘要

Abstract:Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: this https URL.

104. 【2605.20735】Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

链接：https://arxiv.org/abs/2605.20735

作者：Siamul Karim Khan,Patrick J. Flynn,Adam Czajka

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：providing both Python, open-source iris recognition, iris, iris recognition, official IREX

备注：

点击查看摘要

Abstract:This paper proposes two new open-source iris recognition algorithms, providing both Python and IREX-compliant C++ implementations to be submitted to the official IREX X program. This work has two primary goals: (a) to conduct the first-ever assessment of open-source iris recognition solutions according to IREX testing protocols, and (b) to offer a model C++ submission that significantly facilitates the entry of other teams' open-source methods into the IREX evaluation. The new methods consist of two Neural Networks trained with: (i) Triplet loss with Batch-Hard Triplet mining (TripletIris), and (ii) ArcFace loss (ArcIris). The paper also provides open-source IREX-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). Except for CRYPTS, which faced timing constraints during 1:N search, these methods have undergone the official IREX X evaluation and have also been assessed using several popular academic benchmarks: Quality-Face/Iris Research Ensemble, Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2. Finally, this paper also provides open-source models for iris segmentation and circle estimation that can be incorporated into any new iris recognition method.

105. 【2605.20733】Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

链接：https://arxiv.org/abs/2605.20733

作者：Wenda Wang,Anqi Liu,Junqi Yang,Lei He,Luying Wang,Jiachen Lu,Weixin Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：geometries remains challenging, remains challenging due, Converting hand-drawn sketches, representing non-Euclidean surfaces, maintaining topological consistency

备注： 22 pages, 16 figures, includes appendix

点击查看摘要

Abstract:Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at this https URL.

106. 【2605.20732】Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

链接：https://arxiv.org/abs/2605.20732

作者：Kin Whye Chew,Jingxian Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, Convolutional Neural, Neural Networks, learning superficially predictive, causally irrelevant features

备注： Under review. 26 pages, 7 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

107. 【2605.20731】ASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

链接：https://arxiv.org/abs/2605.20731

作者：Haonan Zhu,Elad Hirsch,Alexandria Minetti,Allison Nulty,Purvanshi Mehta

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)

关键词：models produce graphic, production scale, verdict per comparison, produce graphic design, photo-style preference data

备注：

点击查看摘要

Abstract:Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.

108. 【2605.20728】Early High-Frequency Injection for Geometry-Sensitive OOD Detection

链接：https://arxiv.org/abs/2605.20728

作者：Chuanjie Cheng,Ningkang Peng,Chenxi Liu,Yifan He,Peirong Ma,Yanhui Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Post-hoc OOD detectors, Post-hoc OOD, detectors score logits, OOD detectors score, representation method PALM

备注：

点击查看摘要

Abstract:Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at this https URL.

109. 【2605.20727】GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

链接：https://arxiv.org/abs/2605.20727

作者：Ningkang Peng,Jingyang Mao,Xiaoqian Peng,Peirong Ma,Xichen Yang,Weiguang Qu,Yanhui Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep neural networks, Deep neural, processing noisy labels, experience significant performance, significant performance degradation

备注：

点击查看摘要

Abstract:Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

110. 【2605.20725】Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

链接：https://arxiv.org/abs/2605.20725

作者：Jingyang Mao,Ningkang Peng,Yanhui Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：combines external annotations, single reliability weight, Learning with noisy, multimedia classification, classification often combines

备注：

点击查看摘要

Abstract:Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.

111. 【2605.20717】E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference

链接：https://arxiv.org/abs/2605.20717

作者：Ankit Kumar Tenwar,Mukul Lokhande,Santosh Kumar Vishvakarma

类目：Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：work presents E-ReCON, resource-efficient digital, work presents, proposed bitcell occupies, conventional convolutional neural

备注：

点击查看摘要

Abstract:This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based on a compact 3T1R ReRAM bitcell for edge-AI inference. The proposed bitcell occupies only 0.85 um^2 and supports reliable AND-based in-memory multiplication for both conventional convolutional neural network (CNN) and spiking neural network (SNN) workloads. To reduce accumulation overhead, a novel interleaved 10T/28T adder tree is introduced, reducing transistor count and power consumption by 37% and 28%, respectively, compared to a conventional 28T RCA-based design. Implemented in 65 nm CMOS at 1.2 V, the proposed macro achieves a minimum latency of 0.48 ns, throughput of 2.31-3.1 TOPS, and energy efficiency of up to 419 TOPS/W. When evaluated on LeNet-5, AlexNet, and CNN-8 models, the macro achieves 97.81%, 93.23%, and 96.51% accuracy on MNIST/A-Z, CIFAR10, and SVHN datasets, respectively. In addition, 40% pruning preserves nearly 99.8% of the original accuracy while reducing MAC operations and computation cycles. For SNN-oriented workloads, the proposed AND-type bitcell efficiently supports spike-weight multiplication with low switching activity, where the 2A2W configuration achieves accuracy close to the FP32 baseline across VGG-8, VGG-16, and ResNet-18 networks on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. Compared to prior ADC-based ReRAM-CIM designs, the proposed architecture improves latency and energy efficiency by nearly 30-40% while maintaining robust operation under full PVT and ReRAM variability. Overall, E-ReCON provides a scalable, low-latency, and energy-efficient nvCIM platform for next-generation edge-AI, IoT, biomedical sensing, and neuromorphic applications.

112. 【2605.20713】SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

链接：https://arxiv.org/abs/2605.20713

作者：Miaobo Hu,Shuhao Hu,Bokun Wang,Rui Chen,Xin Wang,Xiaobo Guo,Daren Zha,Jun Xiao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：attach multiple images, weakly related, social media, media is difficult, post may attach

备注：

点击查看摘要

Abstract:Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2605.20713 [cs.CV]

(or
arXiv:2605.20713v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.20713

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

113. 【2605.20708】Rethinking Cross-Layer Information Routing in Diffusion Transformers

链接：https://arxiv.org/abs/2605.20708

作者：Chao Xu,Maohua Li,Qirui Li,Yixuan Xu,Yanke Zhou,Yunhe Li,Cuifeng Shen,Hanlin Tang,Kan Liu,Tao Lan,Lin Qu,Shao-Qun Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：modern visual generation, visual generation, latent autoencoders, extensively revisited, facto backbone

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

114. 【2605.20682】IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

链接：https://arxiv.org/abs/2605.20682

作者：Rongbin Tan,Fangfang Lin,Zhenlong Yuan,Min Qiu,Kejin Cui,Mengmeng Wang,Yi Wang,Zijian Song,Zhiyuan Wang,Jiyuan Wang,Yue Wang,Shuhan Song§,Huawei Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, shown remarkable capability, diverse industrial scenarios, Multimodal large, large language models

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

115. 【2605.20680】DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

链接：https://arxiv.org/abs/2605.20680

作者：Jiaqi Chen,Qinfu Xu,Liyuan Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse real-world applications, Human Action Recognition, fundamental computer vision, computer vision task, Human Action

备注： 8pages,7 figures

点击查看摘要

Abstract:Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

116. 【2605.20676】VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

链接：https://arxiv.org/abs/2605.20676

作者：Mozhgan Nasr Azadani,Yimu Wang,Yongpeng Zhu,Lihong Chen,Milan Ganai,Sean Sedwards,Marco Pavone,Krzysztof Czarnecki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：current multimodal large, multimodal large language, Establishing a clear, large language model, current multimodal

备注：

点击查看摘要

Abstract:Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

117. 【2605.20669】GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

链接：https://arxiv.org/abs/2605.20669

作者：Jiahao Kong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires accurate real-time, accurate real-time detection, strict speed requirements, inspection requires accurate, Sparse Structure Selection

备注： 41 pages, 8 figures, submitted to Scientific Reports

点击查看摘要

Abstract:X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

118. 【2605.20667】LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

链接：https://arxiv.org/abs/2605.20667

作者：Liming Hou,Yueping Peng,Hexiang Hao,Ji Wang,Xuekai Zhang,Wei Tang,Zecong Ye,Xin Ying,Yubo He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Detecting small unmanned, small unmanned aerial, unmanned aerial vehicles, remote-sensing pairs remains, pairs remains challenging

备注： 17 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

119. 【2605.20659】RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

链接：https://arxiv.org/abs/2605.20659

作者：Yuxi Liu,Zekun Zhang,Yixiang Cai,Renjia Deng,Yutong He,Kun Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Diffusion Transformers, Rotary Position Embeddings, attention complexity poses, revolutionized high-fidelity video, long-sequence synthesis

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

120. 【2605.20651】Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

链接：https://arxiv.org/abs/2605.20651

作者：Tuopusen Huang,Ding Ma,Xiangqian Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Coherence Tomography Angiography, Optical Coherence Tomography, Existing deep learning, Tomography Angiography, deep learning frameworks

备注：

点击查看摘要

Abstract:Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

121. 【2605.20645】Seeing Through Fog: Towards Fog-Invariant Action Recognition

链接：https://arxiv.org/abs/2605.20645

作者：Enqi Liu,Liyuan Pan,Zhi Gao,Lingzhi Li,Qing Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：typically assume favorable, assume favorable weather, high-quality video inputs, existing action recognition, approaches typically assume

备注：

点击查看摘要

Abstract:Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

122. 【2605.20640】Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

链接：https://arxiv.org/abs/2605.20640

作者：Yunlong Wang,Jinjin Shi,Wenbin Gao,Xuran Xu,Runyu Shi,Ying Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：human portrait generation, aesthetics inherently inhibit, Multimodal Diffusion Transformers, face a severe, severe trilemma

备注：

点击查看摘要

Abstract:Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

123. 【2605.20626】Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

链接：https://arxiv.org/abs/2605.20626

作者：Aashish Dhawan,Christopher Driggers-Ellis,Dzmitry Kasinets,Daisy Zhe Wang,Christan Grant

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Florida Gators submission, University of Florida, Florida Gators, cultural image captioning, present the University

备注：

点击查看摘要

124. 【2605.20624】Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

链接：https://arxiv.org/abs/2605.20624

作者：Taesung Kwon,Jonghyun Park,Hyungjin Chung,Jong Chul Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：multiple VAE passes, provide powerful priors, zero-shot video inverse, video inverse problems, Video Inverse problem

备注： Project page is available here: [this https URL](https://avis-project.github.io/)

点击查看摘要

Abstract:Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

125. 【2605.20610】Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

链接：https://arxiv.org/abs/2605.20610

作者：Gene Tangtartharakul,Katherine R. Storrs

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：interpreted by analysing, analysing which categories, categories are routed, tuning, expert

备注： 21 Pages, 6 Main Figures, 1 Table

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

126. 【2605.20607】Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

链接：https://arxiv.org/abs/2605.20607

作者：Romeo Valentin,Olivia Beyer Bruvik,Marc R. Schlichting,Mykel J. Kochenderfer

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：guidance requires data-driven, requires data-driven aviation, model situation representation, situation representation, open problem

备注： 10 pages, 4 figures

点击查看摘要

Abstract:EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

127. 【2605.20606】Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

链接：https://arxiv.org/abs/2605.20606

作者：Muquan Li,Yingyi Ma,Yihong Huang,Hang Gou,Ke Qin,Ming Li,Yuan-Fang Li,Tao He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large training set, small synthetic set, leave robustness uncontrolled, Robust Dataset Distillation, Dataset distillation

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

128. 【2605.20600】Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

链接：https://arxiv.org/abs/2605.20600

作者：Guotao Liang,Baoquan Zhang,Zhiyuan Wen,Yunming Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：previously generated visual, achieved remarkable performance, caching previously generated, generated visual tokens, high memory usage

备注： Under review

点击查看摘要

Abstract:Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

129. 【2605.20588】Direct Translation between Sign Languages

链接：https://arxiv.org/abs/2605.20588

作者：Zetian Wu,Bowen Xie,Wuyang Meng,Milan Gautam,Stefan Lee,Liang Huang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：witnessed significant progress, remains largely unexplored, languages remains largely, sign language, sign

备注：

点击查看摘要

130. 【2605.20584】QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

链接：https://arxiv.org/abs/2605.20584

作者：Dishanika Denipitiyage,Aruna Seneviratne,Suranga Seneviratne

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：disclose standardized content, marketplaces require developers, require developers, developers to disclose, disclose standardized

备注：

点击查看摘要

Abstract:Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

131. 【2605.20578】A strongly annotated passive acoustic dataset for tropical bird monitoring

链接：https://arxiv.org/abs/2605.20578

作者：Daniela Ruiz,Juan Sebastián Ulloa,Zhongqi Miao,Nicolás Betancourt,Maria Paula Toro-Gómez,Andrés Hernández,Bruno Demuro,Eliana Barona-Cortés,Angela Mendoza-Henao,Andrés Sierra-Ricaurte,Sebastián Pérez-Peña,Rahul Dodhia,Pablo Arbeláez,Juan M. Lavista Ferres

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词：monitoring enables continuous, non-invasive biodiversity assessment, Passive acoustic monitoring, acoustic monitoring enables, enables continuous

备注：

点击查看摘要

Abstract:Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

132. 【2605.20576】$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

链接：https://arxiv.org/abs/2605.20576

作者：Chia-Hsiang Kao,Cong Phuoc Huynh,Chien-Yi Wang,Noranart Vesdapunt,Stefan Stojanov,Bharath Hariharan,Oleksandr Obiednikov,Ning Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Inferring rigid-body physical, rigid-body physical states, Inferring rigid-body, states and properties, properties from monocular

备注： Accepted to CVPR 2026. Project page: [this https URL](https://iandrover.github.io/2026_dynamics)

点击查看摘要

Abstract:Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $\Delta$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $\Delta$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $\Delta$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

133. 【2605.20569】End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

链接：https://arxiv.org/abs/2605.20569

作者：Xu Han,Mohammad Aminul Islam,Lei Wang,Zekun Long,Guanmanyi Fu,Wangshu Cai,Kuldip K. Paliwal,Jun Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：imagery encodes rich, encodes rich material, rich material properties, Hyperspectral imagery encodes, improve tracking robustness

备注：

点击查看摘要

Abstract:Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at this https URL.

134. 【2605.20551】Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

链接：https://arxiv.org/abs/2605.20551

作者：Zichao Zeng,June Moh Goo,Junwei Zheng,Weijia Fan,Jiaming Zhang,Rainer Stiefelhagen,Jan Boehm

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Visual Place Recognition, Visual Place, Place Recognition, query image, reference images

备注：

点击查看摘要

Abstract:Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

135. 【2605.20549】MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

链接：https://arxiv.org/abs/2605.20549

作者：Santiago Galella,Pamela Osuna-Vargas,Maren Wehrheim,Martina G. Vilas,Gemma Roig,Matthias Kaschube

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve strong performance, models achieve strong, scene properties drive, drive their predictions, vision models achieve

备注： 33 pages, 20 figures

点击查看摘要

Abstract:Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

136. 【2605.20544】he Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

链接：https://arxiv.org/abs/2605.20544

作者：Doguhan Yeke,Elif Su Temirel,Ananth Shreekumar,Brandon Lee,Dongyan Xu,Z Berkay Celik

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：translating natural language, natural language instructions, translating natural, action plans, Vision-language models

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at this https URL.

137. 【2605.20543】Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

链接：https://arxiv.org/abs/2605.20543

作者：Huan Huang,Michele Esposito,Chen Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, complex vascular patterns, Accurate vessel segmentation, medical image analysis, Accurate vessel

备注： Pattern Recognition submission. 35 pages, 6 figures

点击查看摘要

Abstract:Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at this https URL.

138. 【2605.20538】Continual Segmentation under Joint Nonstationarity

链接：https://arxiv.org/abs/2605.20538

作者：Prashant Pandey,Himanshu Kumar,Devineni Sri Venkatraya Chowdary,Brejesh Lall

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：availability change simultaneously, data streams induce, streams induce joint, supervision availability change, semantic classes

备注：

点击查看摘要

Abstract:Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

139. 【2605.20536】HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

链接：https://arxiv.org/abs/2605.20536

作者：Chinedu Emmanuel Mbonu,Blessing Nwamaka Iduh,Joseph Ikechukwu Odo,Doris Chinedu Asogwa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：critical clinical task, clinical task complicated, inter-class visual ambiguity, Accurate classification, critical clinical

备注： 7 pages, 4 figures

点击查看摘要

Abstract:Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

140. 【2605.20525】NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

链接：https://arxiv.org/abs/2605.20525

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：brain magnetic resonance, magnetic resonance imaging, visual question answering, visual question, question answering

备注： 30 pages, dataset and benchmark release

点击查看摘要

141. 【2605.20510】ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

链接：https://arxiv.org/abs/2605.20510

作者：Longchao Da,Mithun Shivakoti,Xiangrui Liu,T Pranav Kutralingam,Yezhou Yang,Hua Wei

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：heat island effect, intensifying urban heat, urban heat island, Urban heat exposure, increasingly critical challenge

备注： 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

点击查看摘要

Abstract:Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at this https URL.

142. 【2605.20502】ppett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

链接：https://arxiv.org/abs/2605.20502

作者：Neelkamal Bhuyan

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)

关键词：semantic divergence, texture differences, global domain, representation-space diffusion models, full spectrum

备注： 14 pages

点击查看摘要

Abstract:We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $\eta^2$ (class-conditional F-test) and $\Delta\mu$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

143. 【2605.20495】A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

链接：https://arxiv.org/abs/2605.20495

作者：Abhiram Kandiyana,Ankur Mali,Lawrence O. Hall,Peter R. Mouton,Dmitry Goldgof

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce high-quality ground, high-quality ground truth, Deep-learning pipelines, microscopy image classification, require expensive

备注： Accepted to CVPR workshops, 2026

点击查看摘要

Abstract:Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

144. 【2605.20479】Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

链接：https://arxiv.org/abs/2605.20479

作者：Jianmin Liao,Lixin Shen,Yuesheng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：critical practical bottleneck, TGV variational solvers, modern diffusion-based models, ranging from classical, TGV variational

备注：

点击查看摘要

Abstract:Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

145. 【2605.20476】Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

链接：https://arxiv.org/abs/2605.20476

作者：Matthew Bendel,Stephen W. Bailey,Mithilesh Vaidya,Sumukh Badam,Xingzhe He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Long-horizon video generation, Long-horizon video, video generation suffers, Long-horizon, Anchored Tree Sampling

备注： 30 pages, 23 figures

点击查看摘要

Abstract:Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

146. 【2605.20470】EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

链接：https://arxiv.org/abs/2605.20470

作者：Alzahra Altalib,Chunhui Li,Haytham Al Ewaidat,Khaled Alawneh,Ahmad Qendel,Alessandro Perelli

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

关键词：limiting Hounsfield Unit, Hounsfield Unit, limiting Hounsfield, degraded by scatter, reconstruction artifacts

备注： 10 pages, 4 figures

点击查看摘要

Abstract:Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

147. 【2605.20469】HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

链接：https://arxiv.org/abs/2605.20469

作者：Haoyu Wang,Zitong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image interpretation, pose direct patient, generating clinically plausible, Vision-language models, factually incorrect findings

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

148. 【2605.20461】Understanding Model Behavior in Monocular Polyp Sizing

链接：https://arxiv.org/abs/2605.20461

作者：Xinqi Xiong,Andrea Dunn Beltran,Junmyeong Choi,Sarah K. McGill,Marc Niethammer,Roni Sengupta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：guides surveillance decisions, requiring closer follow-up, stratification guides surveillance, typically requiring closer, Accurate polyp size

备注：

点击查看摘要

Abstract:Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (=5 mm vs. 5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at this https URL.

149. 【2605.20460】HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

链接：https://arxiv.org/abs/2605.20460

作者：Astitva Srivastava,Hsiao-Yu Chen,Ryan Goldade,Philipp Herholz,Zhongshi Jiang,Gene Wei-Chin Lin,Lingchen Yang,Nikolaos Sarafianos,Tuur Stuyck,Doug Roble,Avinash Sharma,Egor Larionov

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, high-quality results closer, brought high-quality results, brought high-quality, Recent

备注：

点击查看摘要

Abstract:Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

150. 【2605.20459】Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

链接：https://arxiv.org/abs/2605.20459

作者：Sarmad Khan,Arslan Shaukat,Umer Asgher,Basim Azam

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：recent years, medical image segmentation, notable increase, level of attention, algorithms based

备注： 7 pages, 6 figures, 4 tables

点击查看摘要

Abstract:In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

151. 【2605.20458】ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

链接：https://arxiv.org/abs/2605.20458

作者：Erick O. Rodrigues,Aura Conci,Panos Liatsis

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：age-related macular degeneration, Vascular structures, including age-related macular, macular degeneration, diabetic retinopathy

备注：

点击查看摘要

Abstract:Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

152. 【2605.20448】Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

链接：https://arxiv.org/abs/2605.20448

作者：Animesh Maheshwari,Divyansh Sahu,Nishit Verma

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：objects inhabit, language models reliably, reliably name objects, objects, models reliably

备注：

点击查看摘要

Abstract:Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

153. 【2605.20445】A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT X-ray Imagery

链接：https://arxiv.org/abs/2605.20445

作者：Sarmad Khan,Arslan Shaukat,Umer Asgher,Basim Azam

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：numerous lives daily, lives daily, significant challenge, challenge that led, loss of numerous

备注： 6 pages, 2 figures, 5 tables

点击查看摘要

Abstract:COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

154. 【2605.20436】Lighting-aware Unified Model for Instance Segmentation

链接：https://arxiv.org/abs/2605.20436

作者：Qisai Liu,Alloy Das,Zhanhong Jiang,Joshua R. Waite,Aditya Balu,Adarsh Krishnamurthy,Soumik Sarkar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Foundation models, impressive zero-shot generalization, Segment Anything Model, demonstrate impressive zero-shot, impressive zero-shot

备注：

点击查看摘要

Abstract:Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

155. 【2605.20390】STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

链接：https://arxiv.org/abs/2605.20390

作者：Yingwei Li,Xin Huang,Yang Liu,Yang Fu,Alex Zihao Zhu,Chen Song,Junwen Yao,Anant Subramanian,Hao Xiang,Weijing Shi,Yuliang Zou,Tom Hoddes,Zhaoqi Leng,Govind Thattai,Dragomir Anguelov,Mingxing Tan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：demonstrated remarkable success, demonstrated remarkable, remarkable success, Sparse Window Transformer, Model

备注：

点击查看摘要

Abstract:Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

156. 【2605.20388】How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

链接：https://arxiv.org/abs/2605.20388

作者：Sejoon Jun,Hai Nguyen-Truong,Luigi Seminara,Lorenzo Torresani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：person first-person view, minimize prediction error, view will evolve, completes a task, shot will score

备注： Project page: [this https URL](https://farsightlab.github.io/TrajPilot)

点击查看摘要

Abstract:Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

157. 【2605.20385】ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

链接：https://arxiv.org/abs/2605.20385

作者：Yuan Zhao,Youwei Pang,Jiaming Zuo,Wei Ji,Kailai Zhou,Bin Fan,Yunkang Cao,Lihe Zhang,Xiaofeng Liu,Huchuan Lu,Weisi Lin,Dacheng Tao,Xiaoqi Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent progress, shifted visual perception, concept, segmentation, Recent

备注：

点击查看摘要

Abstract:Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

158. 【2605.20373】SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

链接：https://arxiv.org/abs/2605.20373

作者：Tianshu Wu,Xiangqi Kong,Yue Chen,Qize Yu,Hang Ye,Jia Li,Yizhou Wang,Hao Dong

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Building humanoid robots, real world remains, humanoid robots capable, generalizable whole-body loco-manipulation, Building humanoid

备注： Project Page: [this https URL](https://tianshuwu.github.io/sugar-humanoid/)

点击查看摘要

Abstract:Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: this https URL

159. 【2605.20372】Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

链接：https://arxiv.org/abs/2605.20372

作者：Irem Ulku,Ö. Özgür Tanrıöver,Erdem Akagündüz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：semantic segmentation benefits, combining complementary information, Multimodal semantic segmentation, segmentation benefits remote, benefits remote sensing

备注： 14 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at this https URL

160. 【2605.20362】HAPS: Rethinking Image Similarity for Virtual Staining

链接：https://arxiv.org/abs/2605.20362

作者：Fedor Gubanov,Svetlana Illarionova,Vlad Kozlovskiy,Mikhail Romanov,Yersultan Akhmetov,Aida Akaeva,Vyacheslav Grinevich,Rifat Hamoudi,Maxim Sharaev

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：routinely acquired slides, synthesizing target stains, Virtual staining, virtual staining models, digital pathology

备注： 17 pages, 3 figures

点击查看摘要

Abstract:Virtual staining of histopathology images (e.g., HE-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of HE-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.

161. 【2605.20342】ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

链接：https://arxiv.org/abs/2605.20342

作者：Zuhao Yang,Kaichen Zhang,Sudong Wang,Keming Wu,Zhongyu Yang,Bo Li,Xiaojuan Qi,Shijian Lu,Xingxuan Li,Lidong Bing

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natively invoke video-processing, Training large multimodal, invoke video-processing tools, large multimodal models, Parallel Video Tool

备注： Project Page: [this https URL](https://evolvinglmms-lab.github.io/ParaVT/)

点击查看摘要

Abstract:Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

162. 【2605.20337】Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

链接：https://arxiv.org/abs/2605.20337

作者：Julien Colin,Lore Goetschalckx,Nuria Oliver,Thomas Serre

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leading vision models, models, leading vision, vision models, foundation models

备注：

点击查看摘要

Abstract:How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

163. 【2605.20316】FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

链接：https://arxiv.org/abs/2605.20316

作者：Eric Tillmann Bill,Enis Simsar,Alessio Tonioni,Thomas Hofmann

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：encode rich visual, rich visual priors, one-way text-conditioned generation, models encode rich, rich visual

备注： project page: [this https URL](https://ericbill21.github.io/fullflow/)

点击查看摘要

Abstract:Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

164. 【2605.20309】ny-Engram: Trigger-Indexed Concept Tables for Generative Vision

链接：https://arxiv.org/abs/2605.20309

作者：Runyuan Cai,Yiming Wang,Yu Lin,Xiaodong Zeng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Current personalization methods, vision models typically, models typically encode, generative vision models, weight updates

备注：

点击查看摘要

Abstract:Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

165. 【2605.20308】SDM: A Powerful Tool for Evaluating Model Robustness

链接：https://arxiv.org/abs/2605.20308

作者：Xinlei Liu,Tao Hu,Jichao Xie,Peng Yi,Hailong Ma,Baolin Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：evaluating model robustness, model robustness, evaluating model, gradient-based attack method, Sequential Difference Maximization

备注： 16 pages

点击查看摘要

Abstract:Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at this https URL.

166. 【2605.20306】WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

链接：https://arxiv.org/abs/2605.20306

作者：Bingnan Liu,Chenhang Cui,Rui Huang,Jiani Luo,Zhirong Shen,Tinghao Wang,Xiande Huang,Lingbei Meng,Fei Shen,An Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：annotated UAV corpus, professionally annotated UAV, couples direct visual, single professionally annotated, UAV corpus

备注： Preprint. Under review. 4 figures, 6 tables

点击查看摘要

Abstract:We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at this https URL to support reproducible follow-up research.

167. 【2605.20302】Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

链接：https://arxiv.org/abs/2605.20302

作者：Panagiotis Koromilas,Theodoros Giannakopoulos,Mihalis A. Nicolaou,Yannis Panagakis

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Neural Collapse, dominant paradigms reaches, theoretical optimum, Supervised classification, NTCE and NONL

备注： 43rd International Conference on Machine Learning (ICML 2026); Code: [this https URL](https://github.com/pakoromilas/nc_by_design)

点击查看摘要

Abstract:Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method, prototype contrast on the unit hypersphere, and that closing the gap requires fixing each at its specific point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and lower mCE on ImageNet-C, recasting supervised learning as prototype learning on the hypersphere, with NC reached by design on both paths.

168. 【2605.20301】Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

链接：https://arxiv.org/abs/2605.20301

作者：Wenxuan Li,Qin Zou,Shoubing Chen,Chi Chen,Yingyi Yang,Shoubing Chen,Qingxiang Meng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：autonomous driving, detection is essential, essential for accurate, accurate perception, object detection

备注：

点击查看摘要

Abstract:In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.20301 [cs.CV]

(or
arXiv:2605.20301v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.20301

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Wenxuan Li [view email] [v1]
Tue, 19 May 2026 12:36:09 UTC (1,231 KB)

169. 【2605.20297】MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

链接：https://arxiv.org/abs/2605.20297

作者：Ziyuan Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：learning requires discovering, effective continual learning, tasks share sufficient, share sufficient structure, image segmentation faces

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at this https URL.

170. 【2605.20290】Physics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction

链接：https://arxiv.org/abs/2605.20290

作者：Xin Zhang,Yabo Chen,Yijie Fang,Wanying Qu,Haibin Huang,Chi Zhang,Feng Xu,Xuelong Li

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent generative video, models achieve impressive, Recent generative, generative video models, video models achieve

备注：

点击查看摘要

Abstract:Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at this https URL.

171. 【2605.20287】FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

链接：https://arxiv.org/abs/2605.20287

作者：Haoyi Zhang,Kairong Guo,Bojie Zhang,Yibo Lin,Runsheng Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Standard cells form, critically influence chip-level, influence chip-level performance, fast predictors ignore, ignore layout geometry

备注：

点击查看摘要

Abstract:Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

172. 【2605.20284】JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

链接：https://arxiv.org/abs/2605.20284

作者：Hyunju Kang,Woohyun Lee,Jaewon Kim,Hogun Park

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Multimodal Models, diverse human instructions, Large Multimodal, advanced by Large, enabling diverse human

备注： Published at ICLR 2026

点击查看摘要

Abstract:Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

173. 【2605.20282】Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

链接：https://arxiv.org/abs/2605.20282

作者：Zhenyu Yu,Yangchen Zeng,Chunlei Meng,Guangzhen Yao,Shuigeng Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vertical Federated Learning, attracted growing interest, Centered Kernel Alignment, Linear Probe Recovery, Feature Separability Scoring

备注：

点击查看摘要

Abstract:Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

174. 【2605.20278】ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

链接：https://arxiv.org/abs/2605.20278

作者：Tianle Li,Xuyang Shen,Yan Ma,Rongxin Guo,Shaoxiang Chen,Jiacheng Chen,Haochen Wang,Hongyang Tang,Yucong Zhou,Yu Cheng

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：reward granularity problem, important errors occur, Long-form image captioning, individual visual claims, Long-form image

备注：

点击查看摘要

Abstract:Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

175. 【2605.20277】Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

链接：https://arxiv.org/abs/2605.20277

作者：Tianwei Lin,Zhongwei Qiu,Jie Cao,Jiang Liu,Wenjie Yan,Bo Zhang,Yu Zhong,Wenqiao Zhang,Yingda Xia,Ling Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Computed Tomography, general-purpose multimodal assistants, analysis remains constrained, Current Reinforcement Learning, multimodal assistants

备注：

点击查看摘要

Abstract:Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{this https URL}{GitHub}.

176. 【2605.20275】You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

链接：https://arxiv.org/abs/2605.20275

作者：Sana Alamgeer,Ronish Kumar,Awatif Yasmin,Muhammad Irshad,Anne H. H. Ngu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Existing deep learning, quadratic computational overhead, deep learning approaches, impose quadratic computational, Existing deep

备注：

点击查看摘要

Abstract:Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

177. 【2605.20267】Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

链接：https://arxiv.org/abs/2605.20267

作者：Suya Li,Kaushik Dutta,Debojyoti Pal,Jingqin Luo,Kooresh I. Shoghi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Synthetic PET images, conventional physics-based simulation, physics-based simulation approaches, Synthetic PET, imaging workflow development

备注： 18 pages, 7 figures

点击查看摘要

Abstract:Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

178. 【2605.20254】Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

链接：https://arxiv.org/abs/2605.20254

作者：Amritansh Maurya,Navjot Singh,Mohammed Javed,Omar Moured

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, requires precise cell, shown promising results, precise cell retrieval

备注： Accepted for Presentation in ICDAR 2026, Vienna, Austria

点击查看摘要

179. 【2605.20247】CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

链接：https://arxiv.org/abs/2605.20247

作者：Yang Liu,Toan Nguyen,Flora D. Salim

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Catastrophic forgetting remains, continual learning, Catastrophic forgetting, remains a major, major obstacle

备注：

点击查看摘要

180. 【2605.20237】AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation

链接：https://arxiv.org/abs/2605.20237

作者：Yixuan Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse editing conditions, Stable Diffusion, enables controllable, controllable and consistent, generation under diverse

备注：

点击查看摘要

Abstract:We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.

181. 【2605.20233】AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

链接：https://arxiv.org/abs/2605.20233

作者：Hanchen David Wang,Yilin Liu,Madison J. Lee,Surya Chand Rayala,Gautam Biswas,Daniel T. Levin,Meiyi Ma

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Assessing learner competency, Assessing learner, clinical simulation requires, simulation requires expert, requires expert observation

备注： Accepted at CVPR Workshop

点击查看摘要

Abstract:Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

182. 【2605.20223】Why Latent Actions Fail, and How to Prevent It

链接：https://arxiv.org/abs/2605.20223

作者：Jung Min Lee,Taehyun Cho,Li Zhao,Jungwoo Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：learn action-like representations, aim to learn, learn action-like, exogenous, Latent

备注：

点击查看摘要

Abstract:Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

183. 【2605.20211】Leveraging Vision-Language Models to Detect Attention in Educational Videos

链接：https://arxiv.org/abs/2605.20211

作者：Gabriel Becquet(LIP6, CNRS, SU),Sébastien Lallé(CNRS, LIP6, SU),Vanda Luengo(LIP6, CNRS, SU),Ali Abou-Hassan(SU, CNRS, PHENIX, IUF)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：cornerstone of remote, remote and blended, blended learning, effective information retention, fluctuating attention remains

备注：

点击查看摘要

Abstract:Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

184. 【2605.21251】Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation

链接：https://arxiv.org/abs/2605.21251

作者：Erick O Rodrigues,Lucas O Rodrigues,João HP Machado,Dalcimar Casanova,Marcelo Teixeira,Jeferson T Oliva,Giovani Bernardes,Panos Liatsis

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：retinal vessel analysis, Frangi filter response, Frangi filter, naive connectivity filter, thresholded Frangi filter

备注：

点击查看摘要

Abstract:A retinal vessel analysis is a procedure that can be used as an assessment of risks to the eye. This work proposes an unsupervised multimodal approach that improves the response of the Frangi filter, enabling automatic vessel segmentation. We propose a filter that computes pixel-level vessel continuity while introducing a local tolerance heuristic to fill in vessel discontinuities produced by the Frangi response. This proposal, called the local-sensitive connectivity filter (LS-CF), is compared against a naive connectivity filter to the baseline thresholded Frangi filter response and to the naive connectivity filter response in combination with the morphological closing and to the current approaches in the literature. The proposal was able to achieve competitive results in a variety of multimodal datasets. It was robust enough to outperform all the state-of-the-art approaches in the literature for the OSIRIX angiographic dataset in terms of accuracy and 4 out of 5 works in the case of the IOSTAR dataset while also outperforming several works in the case of the DRIVE and STARE datasets and 6 out of 10 in the CHASE-DB dataset. For the CHASE-DB, it also outperformed all the state-of-the-art unsupervised methods.

185. 【2605.20496】Platonic Representations in the Human Brain: Unsupervised Recovery of Universal Geometry

链接：https://arxiv.org/abs/2605.20496

作者：Pablo Marcos-Manchón,Rishi Jha,Lluís Fuentemilla

类目：Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

关键词：Strong Platonic Representation, Platonic Representation Hypothesis, Strong Platonic, Representation Hypothesis suggests, Hypothesis suggests

备注： Code available at [this https URL](https://github.com/memory-formation/platonic-representations-fmri)

点击查看摘要

Abstract:The Strong Platonic Representation Hypothesis suggests that representational convergence in artificial neural networks can be harnessed constructively: embeddings can be translated across models through a universal latent space without paired data. We ask whether an analogous geometry can be recovered across human brains. Using fMRI data from the Natural Scenes Dataset, we propose a self-supervised encoder that learns subject-specific embeddings from brain data alone by exploiting repeated stimulus presentations. We show that these independently learned spaces can be translated across subjects using unsupervised orthogonal rotations, without paired cross-subject samples or intermediate model representations. Synchronizing pairwise rotations into a single shared latent space further improves cross-subject retrieval, indicating that subject-specific spaces are mutually compatible with a common coordinate system. These results provide evidence for a shared neural geometry in the human visual cortex: subject-specific fMRI representations are approximately isometric across individuals and can be translated through purely geometric transformations.

186. 【2605.20405】Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

链接：https://arxiv.org/abs/2605.20405

作者：Iason Skylitsis,Dimitrios Karkalousos,Ivana Išgum

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：Class imbalance, frequent classes typically, classes typically dominate, typically dominate training, sampling

备注：

点击查看摘要

Abstract:Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at this https URL.