本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新792篇论文，其中：

自然语言处理121篇
信息检索14篇
计算机视觉180篇

自然语言处理

1. 【2605.15198】ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

链接：https://arxiv.org/abs/2605.15198

作者：Ziyu Guo,Rain Liu,Xinyan Chen,Pheng-Ann Heng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：promising direction, intermediate visual states, Visual, reasoning, Visual reasoning

备注： Project Page: [this https URL](https://atlas-oneword.github.io) Code: [this https URL](https://github.com/ZiyuGuo99/ATLAS)

点击查看摘要

Abstract:Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

2. 【2605.15188】FutureSim: Replaying World Events to Evaluate Adaptive Agents

链接：https://arxiv.org/abs/2605.15188

作者：Shashwat Goel,Nikhil Chandak,Arvindh Arun,Ameya Prabhu,Steffen Staab,Moritz Hardt,Maksym Andriushchenko,Jonas Geiping

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：deployed in dynamic, increasingly deployed, environments that require, require adapting, January to March

备注： 31 pages, 10 main

点击查看摘要

Abstract:AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

3. 【2605.15184】Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

链接：https://arxiv.org/abs/2605.15184

作者：Sahil Sen,Akhil Kasturi,Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah

类目：Computation and Language (cs.CL)

关键词：Large Language Model, autonomously retrieve information, Large Language, enabled complex agentic, complex agentic workflows

备注：

点击查看摘要

Abstract:Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

4. 【2605.15172】MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

链接：https://arxiv.org/abs/2605.15172

作者：Rui Wen,Mark Russinovich,Andrew Paverd,Jun Sakuma,Ahmed Salem

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：assistants in safety, privacy-critical applications, increasingly deployed, deployed as general-purpose, general-purpose assistants

备注：

点击查看摘要

Abstract:Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

Subjects:

Cryptography and Security (cs.CR); Computation and Language (cs.CL)

Cite as:
arXiv:2605.15172 [cs.CR]

(or
arXiv:2605.15172v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2605.15172

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

5. 【2605.15168】xt Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

链接：https://arxiv.org/abs/2605.15168

作者：Sayantan Kumar,Shahriar Noroozizadeh,Juyong Kim,Jeremy C. Weiss

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词：Reconstructing precise clinical, Reconstructing precise, risk in complex, heterogeneous conditions, conditions like sepsis

备注： Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)

点击查看摘要

Abstract:Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

6. 【2605.15156】MeMo: Memory as a Model

链接：https://arxiv.org/abs/2605.15156

作者：Ryan Wei Heng Quek,Sanghyuk Lee,Alfred Wei Lun Leong,Arun Verma,Alok Prakash,Nancy F. Chen,Bryan Kian Hsiang Low,Daniela Rus,Armando Solar-Lezama

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, Large language, range of tasks, subsequent updates, wide range

备注： This paper introduces MeMo, a framework that augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

7. 【2605.15155】Self-Distilled Agentic Reinforcement Learning

链接：https://arxiv.org/abs/2605.15155

作者：Zhengxi Lu,Zhiyuan Yao,Zhuowen Han,Zi-Han Wang,Jinyang Wu,Qi Gu,Xunliang Cai,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：post-training LLM agents, post-training LLM, Agentic Reinforcement Learning, Reinforcement learning, trajectory-level reward signal

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

8. 【2605.15138】Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

链接：https://arxiv.org/abs/2605.15138

作者：Saisab Sadhu,Pratinav Seth,Vinay Kumar Sankarapu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词：Standard unlearning evaluations, unlearning evaluations measure, deployed language model, evaluations measure behavioral, Standard unlearning

备注：

点击查看摘要

Abstract:Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

9. 【2605.15128】MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

链接：https://arxiv.org/abs/2605.15128

作者：Minghao Guo,Qingyue Jiao,Zeru Shi,Yihao Quan,Boxuan Zhang,Danrui Li,Liwei Che,Wujiang Xu,Shilong Liu,Zirui Liu,Mubbasir Kapadia,Vladimir Pavlovic,Jiang Liu,Mengdi Wang,Yiyu Shi,Dimitris N. Metaxas,Ruixiang Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：existing evaluations rarely, evaluations rarely test, existing evaluations, evaluations rarely, rarely test

备注： 46 pages, 15 figures

点击查看摘要

Abstract:Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

10. 【2605.15118】alk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

链接：https://arxiv.org/abs/2605.15118

作者：Karthik Raghu Iyer,Yazdan Jamshidi,Nicholas Bray,Alexey A. Shvets

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：Technique matrix grounded, arXiv security studies, benchmarks collectively cover, LLM attack benchmarks, inference-time attacks extracted

备注：

点击查看摘要

Abstract:We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \ Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

11. 【2605.15110】Proposal and study of statistical features for string similarity computation and classification

链接：https://arxiv.org/abs/2605.15110

作者：E.O. Rodrigues,D. Casanova,M. Teixeira,V. Pegorini,F. Favarim,E. Clua,A. Conci,Panos Liatsis

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词：co-occurrence matrix, run-length matrix, visual computing, strings in general, similarity computation

备注：

点击查看摘要

Abstract:Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

12. 【2605.15104】From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

链接：https://arxiv.org/abs/2605.15104

作者：Md Tahmid Rahman Laskar,Xue-Yong Fu,Seyyed Saeed Sarfjoo,Quinten McNamara,Jonas Robertson,Shashi Bhushan TN

类目：Computation and Language (cs.CL)

关键词：Voice agents increasingly, agents increasingly require, increasingly require reliable, benchmarks remain text-based, prominent tool-calling benchmarks

备注：

点击查看摘要

Abstract:Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

13. 【2605.15102】Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

链接：https://arxiv.org/abs/2605.15102

作者：Renning Pang,Tian Lan,Leyuan Liu,Xiaoming Huang,Piao Tong,Xiaosong Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language model, Large language, based multi-turn dialogue, multi-turn dialogue systems, undermining both consistency

备注：

点击查看摘要

Abstract:Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.

14. 【2605.15081】ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

链接：https://arxiv.org/abs/2605.15081

作者：Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：prohibitive computational costs, high-quality text embeddings, Matryoshka Embedding Learning, Matryoshka Representation Learning, Matryoshka Layer Learning

备注： Accepted by ICML 2026. The data has been released earlier in the preprint [arXiv:2603.19223](https://arxiv.org/abs/2603.19223)

点击查看摘要

Abstract:The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

15. 【2605.15077】Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

链接：https://arxiv.org/abs/2605.15077

作者：Guangyu Feng,Huanzhi Mao,Prabal Dutta,Joseph E. Gonzalez

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：modern LLM agents, typically constrained, modern LLM, LLM agents, Function calling

备注：

点击查看摘要

Abstract:Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

16. 【2605.15071】On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

链接：https://arxiv.org/abs/2605.15071

作者：Mukul Ranjan,Prince Jha,Khushboo Kumari,Zhiqiang Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：educational platforms, increasingly applied, digital archives, archives to educational, cultural heritage materials

备注： Project Page: [this https URL](https://khushboo0012.github.io/tab-vlm-webpage/)

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

17. 【2605.15041】Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

链接：https://arxiv.org/abs/2605.15041

作者：Renning Pang,Tian Lan,Leyuan Liu,Piao Tong,Sheng Cao,Xiaosong Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：extends large language, strict structural validity, large language models, reliable execution requires, execution requires balancing

备注：

点击查看摘要

Abstract:Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

18. 【2605.15040】Orchard: An Open-Source Agentic Modeling Framework

链接：https://arxiv.org/abs/2605.15040

作者：Baolin Peng,Wenlin Yao,Qianhui Wu,Hao Cheng,Xiao Yu,Rui Yang,Tao Ge,Alessandrio Sordoni,Xingdi Yuan,Yelong Shen,Pengcheng He,Tong Zhang,Zhou Yu,Jianfeng Gao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：autonomous agents capable, solving complex tasks, Agentic modeling aims, Orchard Env, aims to transform

备注：

点击查看摘要

Abstract:Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

19. 【2605.15034】AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

链接：https://arxiv.org/abs/2605.15034

作者：Vinicius Covas,Jorge Alberto Hidalgo Toledo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词：Large language models, contexts remains underexplored, Large language, socially structured contexts, structured contexts remains

备注： 20 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

20. 【2605.15019】From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

链接：https://arxiv.org/abs/2605.15019

作者：Guanhua Chen,Chuyue Huang,Yutong Yao,Shudong Liu,Xueqing Song,Lidia S. Chao,Derek F. Wong

类目：Computation and Language (cs.CL)

关键词：making failures unverifiable, fine-grained user queries, Multimodal Retrieval-Augmented Generation, systems retrieve evidence, systems retrieve

备注：

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

21. 【2605.15016】COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

链接：https://arxiv.org/abs/2605.15016

作者：Zihan Deng,Xiaozhen Zhong,Chuanzhi Xu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：models empower healthcare, intelligent clinical decision, clinical decision support, empower healthcare, developed rapidly

备注：

点击查看摘要

Abstract:As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at this https URL.

22. 【2605.15015】Small, Private Language Models as Teammates for Educational Assessment Design

链接：https://arxiv.org/abs/2605.15015

作者：Chris Davis Jaldi,Anmol Saini,Shan Zhang,Noah Schroeder,Cogan Shimizu,Eleni Ilkou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Large Language Models, Generative AI increasingly, increasingly supports educational, Large Language, Small Language Models

备注：

点击查看摘要

Abstract:Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

23. 【2605.15012】Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

链接：https://arxiv.org/abs/2605.15012

作者：Kai Yan,Alexander G. Schwing,Yu-Xiong Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, developing Large Language, Reinforcement Learning, Verifiable Rewards, Language Models

备注： 25 pages, 11 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

24. 【2605.15011】he Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

链接：https://arxiv.org/abs/2605.15011

作者：Peter A. Jansen

类目：Computation and Language (cs.CL)

关键词：contributions rarely develop, Scientific contributions rarely, Scientific Contribution Graph, Scientific contributions, develop in isolation

备注： 8 pages, 4 figures

点击查看摘要

Abstract:Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

25. 【2605.15000】Quantifying and Mitigating Premature Closure in Frontier LLMs

链接：https://arxiv.org/abs/2605.15000

作者：Rebecca Handler,Suhana Bedi,Nigam Shah

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Premature closure, large language models, conclusion before sufficient, sufficient information, recognized contributor

备注： 14 pages, 3 figures, 1 table

点击查看摘要

Abstract:Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

26. 【2605.14995】Explainable Detection of Depression Status Shifts from User Digital Traces

链接：https://arxiv.org/abs/2605.14995

作者：Loris Belcastro,Francesco Gervino,Fabrizio Marozzo,Domenico Talia,Paolo Trunfio

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词：online interactions, inherently timestamped, reflect aspects, user digital traces, social media posts

备注：

点击查看摘要

Abstract:Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

27. 【2605.14978】Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

链接：https://arxiv.org/abs/2605.14978

作者：Jie Jiang,Xing Sun

类目：Computation and Language (cs.CL)

关键词：accelerates LLM inference, decoding accelerates LLM, accelerates LLM, LLM inference, lightweight draft model

备注：

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

28. 【2605.14928】Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

链接：https://arxiv.org/abs/2605.14928

作者：Guanhua Chen,Yutong Yao,Shenghe Sun,Ci-Jun Gao,Shudong Liu,Lidia S. Chao,Feng Wan,Derek F. Wong

类目：Computation and Language (cs.CL)

关键词：remains largely unexplored, achieved impressive results, Recent advances, procedure question answering, vision-language models

备注：

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.

29. 【2605.14890】okenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

链接：https://arxiv.org/abs/2605.14890

作者：Volodymyr Ovcharov

类目：Computation and Language (cs.CL)

关键词：tokenize Ukrainian legal, Ukrainian legal text, systematic comparison exists, Foundation models tokenize, models tokenize Ukrainian

备注： 22 pages, 21 tables, 3 figures

点击查看摘要

Abstract:Foundation models tokenize Ukrainian legal text with vastly different efficiency, yet no systematic comparison exists for this domain. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Three findings emerge. (1) Tokenizer fertility varies 1.6x: Qwen3 models consume 60% more tokens than Llama-family models on identical input, directly reducing API cost. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (675B total, 41B active) -- a model with 5.6x more total parameters and 3.4x more active parameters per token -- at one-third the API cost. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages.

30. 【2605.14865】Holistic Evaluation and Failure Diagnosis of AI Agents

链接：https://arxiv.org/abs/2605.14865

作者：Netta Madvil,Gilad Dym,Alon Mecilati,Edo Dekel,Jonatan Liberman,Rotem Brazilay,Liron Schliesser,Max Svidlo,Shai Nir,Orel Shalom,Yaron Friedman,David Connack,Amos Rimon,Philip Tannor,Shir Chorev

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：complex multi-step processes, connect failure types, execute complex multi-step, process-level approaches struggle, agents execute complex

备注：

点击查看摘要

Abstract:AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

31. 【2605.14816】Conversion of Lexicon-Grammar tables to LMF. Application to French

链接：https://arxiv.org/abs/2605.14816

作者：Eric Laporte,Elsa Tolone,Mathieu Constant

类目：Computation and Language (cs.CL)

关键词：Lexical Markup Framework, Markup Framework, Lexical Markup, French verbs, tables for French

备注：

点击查看摘要

Abstract:We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

32. 【2605.14790】Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

链接：https://arxiv.org/abs/2605.14790

作者：Songyang Gao,Yinghui Xia,Siyi Liu,Hui Xiong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：idea generation, Research idea generation, innovation-driving step, idea, generation

备注：

点击查看摘要

Abstract:Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

33. 【2605.14787】Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

链接：https://arxiv.org/abs/2605.14787

作者：Matteo Attimonelli,Alessandro De Bellis,Aryo Pradipta Gema,Rohit Saxena,Monica Sekoyan,Wai-Chung Kwan,Claudio Pomo,Alessandro Suglia,Dietmar Jannach,Tommaso Di Noia,Pasquale Minervini

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Composed Image Retrieval, Composed Image, reference image, target image satisfying, textual modification

备注：

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

34. 【2605.14766】Streaming Speech-to-Text Translation with a SpeechLLM

链接：https://arxiv.org/abs/2605.14766

作者：Titouan Parcollet,Shucong Zhang,Xianrui Zheng,Rogier C. van Dalen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词：consists of separate, separate modules, translates speech, speech recognition, speech

备注： 9 pages of main text; 24 pages in total

点击查看摘要

Abstract:Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

35. 【2605.14765】Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

链接：https://arxiv.org/abs/2605.14765

作者：Mohammad Hossein Sameti,Diba Hadi Esfangereh,Sepehr Harfi Moridani,Leili Javidpour,Mahdieh Soleymani Baghshah

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：presents significant challenges, primarily on Western, models trained primarily, modal systems, Western music

备注： 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

36. 【2605.14749】Non-linear Interventions on Large Language Models

链接：https://arxiv.org/abs/2605.14749

作者：Sangwoo Kim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Linear Representation Hypothesis, large language models, representative and widely, understanding the internal, large language

备注：

点击查看摘要

Abstract:Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

37. 【2605.14747】Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

链接：https://arxiv.org/abs/2605.14747

作者：Weimin Xiong,Shuhao Gu,Bowen Ye,Zihao Yue,Lei Li,Feifan Song,Sujian Li,Hao Tian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：graphical user interface, multimodal large language, large language models, driven growing interest, generalization remains constrained

备注： Accepted at ICML 2026

点击查看摘要

Abstract:Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

38. 【2605.14744】Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

链接：https://arxiv.org/abs/2605.14744

作者：José Manuel de la Chica Rodríguez,Carlos Martí-González

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Large language models, Large language, regulated financial workflows, creating a principal, agent failure

备注：

点击查看摘要

Abstract:Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

39. 【2605.14723】Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

链接：https://arxiv.org/abs/2605.14723

作者：Minghao Wu,Yuting Yan,Zhenyang Cai,Ke Ji,Chuangsen Fang,Ziying Sheng,Xidong Wang,Rongsheng Wang,Hejia Zhang,Shuang Li,Benyou Wang,Hongyuan Zha

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：ICU requires sequential, ICU requires, evolving patient physiology, requires sequential treatment, Clinical World Model

备注：

点击查看摘要

Abstract:Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

40. 【2605.14712】IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

链接：https://arxiv.org/abs/2605.14712

作者：Shijie Lian,Bin Yu,Xiaopeng Lin,Zhaolong Shen,Laurence Tianruo Yang,Yurun Jin,Haishan Liu,Changti Wu,Hang Yuan,Cong Huang,Kai Chen

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Robot imitation data, human demonstrators act, Robot imitation, similar visual-language observations, task phases

备注： Code can be found in [this https URL](https://github.com/ZGC-EmbodyAI/IntentVLA)

点击查看摘要

Abstract:Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

41. 【2605.14679】AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

链接：https://arxiv.org/abs/2605.14679

作者：Vicent Briva-Iglesias,María Ferre-Fernández

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：interpretive materials globally, increasingly disseminate research, Cultural heritage institutions, Cultural heritage, heritage institutions increasingly

备注：

点击查看摘要

Abstract:Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

42. 【2605.14665】Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

链接：https://arxiv.org/abs/2605.14665

作者：Joy Bose

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：semantic similarity search, similarity search, semantic similarity, Verifier Agent, Supreme Court

备注： 20 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.

43. 【2605.14621】Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

链接：https://arxiv.org/abs/2605.14621

作者：Tian Qin,Junzhe Chen,Yuqing Shi,Tianshu Zhang,Qiang Ju,Lijie Wen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large vision-language models, language priors dominate, priors dominate weak, Large vision-language, vision-language models

备注：

点击查看摘要

Abstract:Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

44. 【2605.14600】SciPaths: Forecasting Pathways to Scientific Discovery

链接：https://arxiv.org/abs/2605.14600

作者：Eric Chamoun,Yizhou Chi,Yulong Chen,Rui Cao,Zifeng Ding,Michalis Korakakis,Andreas Vlachos

类目：Computation and Language (cs.CL)

关键词：Scientific progress depends, benchmarks largely focus, citation prediction, progress depends, enabling contributions

备注：

点击查看摘要

Abstract:Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

45. 【2605.14589】EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

链接：https://arxiv.org/abs/2605.14589

作者：Han Tian,Luxuan Chen,Xinran Chen,Rui Kong,Fang Wang,Jiamin Chen,Jinman Zhao,Yuchen Li,Jiashu Zhao,Shuaiqiang Wang,Haoyi Xiong,Dawei Yin

类目：Computation and Language (cs.CL)

关键词：incurring quadratic memory, large language models, language models typically, make long-context adaptation, long-context adaptation expensive

备注：

点击查看摘要

Abstract:Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at this https URL.

46. 【2605.14570】Uncertainty Quantification for Large Language Diffusion Models

链接：https://arxiv.org/abs/2605.14570

作者：Artem Vazhentsev,Vladislav Smirnov,David Li,Maxim Panov,Timothy Baldwin,Artem Shelmanov

类目：Computation and Language (cs.CL)

关键词：Large Language Diffusion, Large Language, Language Diffusion Models, offering faster inference, offering faster

备注：

点击查看摘要

Abstract:Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

47. 【2605.14568】Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

链接：https://arxiv.org/abs/2605.14568

作者：Ali Hassaan Mughal,Noor Fatima,Muhammad Bilal

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Context, Behaviour-Driven Development, step subsequences, Uniform Manifold Approximation, within-file Background

备注： 30 pages, 12 figures and tables, 58 references. Under review at Software Quality Journal (Springer). Reproduction package at [this https URL](https://github.com/amughalbscs16/cukereuse_subscenarios_release) (Apache-2.0). Upstream cukereuse corpus at [this https URL](https://doi.org/10.5281/zenodo.19754359)

点击查看摘要

Abstract:Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

48. 【2605.14563】Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

链接：https://arxiv.org/abs/2605.14563

作者：Suyoung Bae,Jaehoon Lee,Changkyu Choi,YunSeok Choi,Jee-Hyong Lee

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：navigate large codebases, Automated code documentation, Automated code, coding agents rely, providing the contextual

备注：

点击查看摘要

Abstract:Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.

49. 【2605.14558】Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

链接：https://arxiv.org/abs/2605.14558

作者：Langzhou He,Junyou Zhu,Yue Zhou,Zhengyao Gu,Junhua Liu,Wei-Chieh Huang,Henry Peng Zou,David Wipf,Philip S. Yu,Qitian Wu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Agentic reinforcement learning, reinforcement learning trains, learning trains large, trains large language, Agentic reinforcement

备注： Preprint

点击查看摘要

Abstract:Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

50. 【2605.14539】Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

链接：https://arxiv.org/abs/2605.14539

作者：Mengjie Ren,Jie Lou,Boxi Cao,Xueru Wen,Hongyu Lin,Xianpei Han,Le Sun,Xing Yu,Yaojie Lu

类目：Computation and Language (cs.CL)

关键词：Verifiable Rewards, Reinforcement Learning, large language models, paradigm for improving, capabilities of large

备注： Work on progress

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

51. 【2605.14531】Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

链接：https://arxiv.org/abs/2605.14531

作者：ZiYi Dong,Yuliang Huang,Weijian Deng,Xiangyang Ji,Liang Lin,Pengxu Wei

类目：Computation and Language (cs.CL)

关键词：Irreversibility Error Propagation, adjoint state vanishing, Irreversibility Error, Error Propagation, Optimization Tractability

备注：

点击查看摘要

Abstract:This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

52. 【2605.14517】Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

链接：https://arxiv.org/abs/2605.14517

作者：GAng Peng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：user specific intent, user request, user specific, intent fidelity evaluation, evaluation scores capture

备注： Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation

点击查看摘要

Abstract:Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

53. 【2605.14498】GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

链接：https://arxiv.org/abs/2605.14498

作者：Jingbo Yang,Kwei-Herng Lai,Xiaowen Wang,Shiyu Chang,Yaar Harari,Evgeniy Gabrilovich

类目：Computation and Language (cs.CL)

关键词：Large Language Model, agents increasingly serve, Large Language, Language Model, memory systems

备注：

点击查看摘要

Abstract:Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

54. 【2605.14480】Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

链接：https://arxiv.org/abs/2605.14480

作者：Ji-eun Kim

类目：Computation and Language (cs.CL)

关键词：underlying Huìtóngguǎnxì Huáyíyìyǔ, principles underlying Huìtóngguǎnxì, Huìtóngguǎnxì Huáyíyìyǔ, underlying Huìtóngguǎnxì, Ming government

备注： 47 pages; 1 figure; 40 tables; SLE2019; under review

点击查看摘要

Abstract:Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

Comments:
47 pages; 1 figure; 40 tables; SLE2019; under review

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.14480 [cs.CL]

(or
arXiv:2605.14480v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.14480

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ji-Eun Kim [view email] [v1]
Thu, 14 May 2026 07:21:18 UTC (441 KB)

55. 【2605.14478】When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

链接：https://arxiv.org/abs/2605.14478

作者：Haojun Weng,Qianqian Yang,Hao Fu,Haobin Pan,Xinwei Lv

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Retrieval-augmented code generation, Toggle, Retrieval-augmented code, code generation relies, code

备注： 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)

点击查看摘要

Abstract:Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.

Comments:
31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
D.2.5; D.2.7; I.2.7

Cite as:
arXiv:2605.14478 [cs.SE]

(or
arXiv:2605.14478v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2605.14478

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Haojun Weng [view email] [v1]
Thu, 14 May 2026 07:18:30 UTC (22 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context, by Haojun Weng and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.SE

|
next

new
|
recent
| 2026-05

Change to browse by:

cs
cs.AI
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

57. 【2605.14454】LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

链接：https://arxiv.org/abs/2605.14454

作者：Minbeom Kim,Lesly Miculicich,Bhavana Dalvi Mishra,Mihir Parmar,Phillip Wallis,Bharath Chandrasekhar,Kyomin Jung,Tomas Pfister,Long T. Le

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：read private data, execute multi-step workflows, concrete deployment harms, call tools, private data

备注： 27 pages, 3 figures

点击查看摘要

Abstract:As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

58. 【2605.14449】When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

链接：https://arxiv.org/abs/2605.14449

作者：Siyang Yao,Erhu Feng,Yubin Xia

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, requires balancing accu, balancing accu racy, language models, requires balancing

备注：

点击查看摘要

Abstract:Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

59. 【2605.14448】hink When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

链接：https://arxiv.org/abs/2605.14448

作者：Longxiang Zhang,Weilong Dai,Guanghao Zhang,Hao Jiang,Pipei Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Multimodal large language, large language models, large language, Multimodal large, reasoning

备注： 30 pages, preprint

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

60. 【2605.14427】A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

链接：https://arxiv.org/abs/2605.14427

作者：Sunil Kumar Kopparapu

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：automatic speech recognition, hybrid automatic speech, vocabulary size, vocabulary size hyper-parameter, ASR systems

备注： 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

点击查看摘要

Abstract:In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

61. 【2605.14415】SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

链接：https://arxiv.org/abs/2605.14415

作者：Man Ho Lam,Chaozheng Wang,Hange Liu,Jingyu Xiao,Haau-sing Li,Jen-tse Huang,Terry Yue Zhuo,Michael R. Lyu

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：isolated issue resolution, Coding agents powered, large language models, perform realistic software, software maintenance tasks

备注：

点击查看摘要

Abstract:Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

62. 【2605.14404】Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

链接：https://arxiv.org/abs/2605.14404

作者：Kyomin Hwang,Hyeonjin Kim,Sangyeon Cho,Nojun Kwak

类目：Computation and Language (cs.CL)

关键词：pose privacy risks, sensitive personally identifiable, personally identifiable information, Knowledge Separability Score, Knowledge Persistence Score

备注：

点击查看摘要

Abstract:While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

63. 【2605.14401】Agentic Recommender System with Hierarchical Belief-State Memory

链接：https://arxiv.org/abs/2605.14401

作者：Xiang Shen,Yuhang Zhou,Yifan Wu,Zhuokai Zhao,Siyu Lin,Lei Huang,Qianqian Zhong,Lizhu Zhang,Benyu Zhang,Xiangjun Fan,Hong Yan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Memory-augmented LLM agents, existing approaches universally, approaches universally adopt, universally adopt flat, advanced personalized recommendation

备注： 4 figures, 8 tables

点击查看摘要

Abstract:Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

64. 【2605.14389】Nexus : An Agentic Framework for Time Series Forecasting

链接：https://arxiv.org/abs/2605.14389

作者：Sarkar Snigdha Sarathi Das,Palash Goyal,Mihir Parmar,Nanyun Peng,Vishy Tirumalashetty,Chun-Liang Li,Rui Zhang,Jinsung Yoon,Tomas Pfister

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Time Series Foundation, Series Foundation Models, Time series, specialized Time Series, Time series forecasting

备注： 30 Pages, 3 figures, 5 Tables

点击查看摘要

Abstract:Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

65. 【2605.14381】NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

链接：https://arxiv.org/abs/2605.14381

作者：Qazi Mamunur Rashid,Xuan Yang,Zhengzhe Yang,Yanzhou Pan,Erin van Liemt,Darlene Neal,Kshitij Pancholi,Jamila Smith-Loud

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Recent advancements, facilitate large-scale synthetic, large-scale synthetic data, synthetic data generation, advancements in generative

备注：

点击查看摘要

Abstract:Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (this https URL).

66. 【2605.14380】Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

链接：https://arxiv.org/abs/2605.14380

作者：Hoang-Thuy-Duong Vu,Quoc-Cuong Pham,Huy-Hieu Pham

类目：Computation and Language (cs.CL)

关键词：unconscious cognitive processes, emotional distress, unconscious cognitive, cognitive processes, processes that modulate

备注：

点击查看摘要

Abstract:Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: this https URL.

67. 【2605.14368】Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

链接：https://arxiv.org/abs/2605.14368

作者：Injin Kong,Hyoungjoon Lee,Yohan Jo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：spaces poorly suited, lag behind autoregressive, applied in spaces, spaces poorly, poorly suited

备注：

点击查看摘要

Abstract:Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

68. 【2605.14366】Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

链接：https://arxiv.org/abs/2605.14366

作者：Zeli Su,Ziyin Zhang,Zhou Liu,Xuexian Song,Zhankai Xu,Longfei Zheng,Xiaolu Zhang,Rong Fu,Guixian Xu,Wentao Zhang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Extending large language, Extending large, Relative Policy Optimization, Group Relative Policy, cost of catastrophic

备注： ACL 2026 Findings

点击查看摘要

Abstract:Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

69. 【2605.14360】A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring

链接：https://arxiv.org/abs/2605.14360

作者：Tamunotonye Harry,Johanna Hidalgo,Matthew Price,Yuanyuan Feng,Kathryn Stanton,Connie Tompkins,Peter Sheridan Dodds,Mikaela Irene Fudolig,Laura Bloomfield,Christopher Danforth

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：Wearable devices capture, Wearable devices, limiting passive sensing, passive sensing utility, psychological context shaping

备注： Submitted to ACM IMWUT

点击查看摘要

Abstract:Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.

70. 【2605.14355】Herculean: An Agentic Benchmark for Financial Intelligence

链接：https://arxiv.org/abs/2605.14355

作者：Xueqing Peng,Zhuohan Xie,Yupeng Cao,Haohang Li,Lingfei Qian,Yan Wang,Vincent Jim Zhang,Huan He,Xuguang Ai,Linhai Ma,Ruoyu Xiang,Yueru He,Yi Han,Shuyao Wang,Yuqing Guo,Mingyang Jiang,Yilun Zhao,Youzhong Dong,Xiaoyu Wang,Yankai Chen,Ye Yuan,Qiyuan Zhang,Fuyuan Lyu,Haolun Wu,Yonghan Yang,Zichen Zhao,Yuyang Dai,Fan Zhang,Rania Elbadry,Ayesha Gull,Muhammad Usman Safder,Nuo Chen,Fengbin Zhu,Tianshi Cai,Zimu Wang,Polydoros Giannouris,Yuechen Jiang,Zhiwei Liu,Mohsinul Kabir,Yuyan Wang,Yixiang Zheng,Yangyang Yu,Weijin Liu,Wenbo Cao,Anke Xu,Peng Lu,Jerry Huang,Fengran Mo,Mingquan Lin,Prayag Tiwari,Yijia Zhao,Victor Gutierrez Basulto,Xiao-Yang Liu,Kaleb E Smith,Jiahuan Pei,Arman Cohan,Jimin Huang,Yuehua Tang,Alejandro Lopez-Lira,Xi Chen,Xue Liu,Junichi Tsujii,Jian-Yun Nie,Sophia Ananiadou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：solve isolated well-defined, well-defined financial tasks, financial professional work, isolated well-defined financial, professional work

备注：

点击查看摘要

Abstract:As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

71. 【2605.14354】LLM-based Detection of Manipulative Political Narratives

链接：https://arxiv.org/abs/2605.14354

作者：Sinclair Schneider,Florian Steuber,Gabi Dreo Rodosek

类目：Computation and Language (cs.CL)

关键词：structuring manipulative political, manipulative political narratives, computational framework, framework for detecting, detecting and structuring

备注： This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026)

点击查看摘要

Abstract:We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.

Comments:
This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026)

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2605.14354 [cs.CL]

(or
arXiv:2605.14354v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.14354

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Sinclair Schneider [view email] [v1]
Thu, 14 May 2026 04:30:21 UTC (98 KB)

72. 【2605.14352】Ideology Prediction of German Political Texts

链接：https://arxiv.org/abs/2605.14352

作者：Sinclair Schneider,Florian Steuber,Joao A. G. Schneider,Gabi Dreo Rodosek

类目：Computation and Language (cs.CL)

关键词：nation ongoing development, Elections represent, ongoing development, represent a crucial, crucial milestone

备注： This paper has been accepted for the upcoming 20th International AAAI Conference on Web and Social Media (ICWSM 2026)

点击查看摘要

Abstract:Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

73. 【2605.14323】Dynamic Latent Routing

链接：https://arxiv.org/abs/2605.14323

作者：Fangyuan Yu,Xin Su,Amir Abdullah

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Markov Decision Processes, Decision Processes, Markov Decision, time-varying reward functions, General Dijkstra Search

备注：

点击查看摘要

Abstract:We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

74. 【2605.14305】Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

链接：https://arxiv.org/abs/2605.14305

作者：Xun Fang,Yunchen Li,Hang Yuan,Zhou Yu

类目：Computation and Language (cs.CL)

关键词：Discrete diffusion language, Diffusion Language Modeling, diffusion language models, Discrete diffusion, introduce factorization errors

备注：

点击查看摘要

Abstract:Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.

75. 【2605.14292】Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

链接：https://arxiv.org/abs/2605.14292

作者：Libo Sun,Po-wei Harn,Peixiong He,Xiao Qin

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：crowded design space, design space spanning, spanning cache representation, space spanning cache, KV-cache compression

备注： 12 pages, 2 figures, 3 tables. Code and data: [this https URL](https://github.com/libophd/minimal-kv-retention)

点击查看摘要

Abstract:KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $\alpha$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $\lambda$. A pre-registered protocol tunes $\lambda$ on a frozen development split and confirms on a disjoint held-out split; with $\lambda = 0.5$, $\alpha$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

76. 【2605.14291】o See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

链接：https://arxiv.org/abs/2605.14291

作者：Chengshuai Zhao,Zhen Tan,Dawei Li,Zhiyuan Yu,Huan Liu

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：posing severe copyright, Large Vision-Language Models, advancement of Large, Large Vision-Language, multimodal web data

备注：

点击查看摘要

Abstract:The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

77. 【2605.14290】Web Agents Should Adopt the Plan-Then-Execute Paradigm

链接：https://arxiv.org/abs/2605.14290

作者：Julien Piet,Annabella Chow,Yiwei Hou,Muxi Lyu,Sylvie Venuto,Jinhao Zhu,Raluca Ada Popa,David Wagner

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：follow this paradigm, web, existing web agents, web agents follow, web agents

备注：

点击查看摘要

Abstract:ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

78. 【2605.14289】MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

链接：https://arxiv.org/abs/2605.14289

作者：Weisen Jiang,Shuhao Chen,Sinno Jialin Pan

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：models scale capacity, existing approaches assume, approaches assume centralized, assume centralized access, combining specialized experts

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at this https URL.

79. 【2605.14271】Auditing Agent Harness Safety

链接：https://arxiv.org/abs/2605.14271

作者：Chengzhi Liu,Yichen Guo,Yepeng Liu,Yuzhe Yang,Qianqi Yan,Xuandong Zhao,Wenyue Hua,Sheng Liu,Sharon Li,Yuheng Bu,Xin Eric Wang

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：LLM agents increasingly, increasingly run inside, agents increasingly run, run inside execution, LLM agents

备注： 11 Pages, 8 Figures

点击查看摘要

Abstract:LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

80. 【2605.14259】Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

链接：https://arxiv.org/abs/2605.14259

作者：Ling Wang,Songnan Liu,Jianan Wang,Cheng Cheng,Xin Liu,Yihan Zhu,Enyu Li,Yu Xiao,Jiangyong Xie,Duogong Yan,Jiangyi Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Applying Large Language, Large Language Models, Applying Large, Large Language, heterogeneous enterprise systems

备注：

点击查看摘要

Abstract:Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

81. 【2605.14257】What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

链接：https://arxiv.org/abs/2605.14257

作者：Adam Nohejl,Xuanxin Wu,Yusuke Ide,Maria Angelica Riera Machin,Yi-Ning Chang,Hitomi Yanaka

类目：Computation and Language (cs.CL)

关键词：fine-tuned encoder baseline, top shared task, high-accuracy black-box model, vocabulary difficulty prediction, shared task result

备注： To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

点击查看摘要

Abstract:We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at this https URL .

82. 【2605.14236】Active Learners as Efficient PRP Rerankers

链接：https://arxiv.org/abs/2605.14236

作者：Jeremías Figueiredo Paschmann,Juan Kaplan,Francisco Nattero Santiago Mauricio Barron Bucolo,Juan Wisznia,Luciano del Corro

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Pairwise Ranking Prompting, elicits pairwise preference, classical sorting algorithms, pairwise preference judgments, Ranking Prompting

备注： 13 pages, 7 figures. Preprint

点击查看摘要

Abstract:Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.

83. 【2605.14227】DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

链接：https://arxiv.org/abs/2605.14227

作者：Yunying Zhu,Andrew R Weckstein,Kueiyu Joshua Lin,Jie Yang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：improving long-term outcomes, Accurate disease trajectory, resource allocation, early intervention, long-term outcomes

备注： Work in Progress

点击查看摘要

Abstract:Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

84. 【2605.14220】Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

链接：https://arxiv.org/abs/2605.14220

作者：Tianle Zhong,Neiwen Ling,Yifan Pi,Zijun Wei,Tianshu Yu,Geoffrey Fox,Peng Wu,Xiao Yu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：systems separate rollout, separate rollout generation, Modern LLM, systems separate, separate rollout

备注：

点击查看摘要

Abstract:Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

85. 【2605.14217】PreFT: Prefill-only finetuning for efficient inference

链接：https://arxiv.org/abs/2605.14217

作者：Andrew Lanpouthakoun,Aryaman Arora,Zhengxuan Wu,Dhruv Pai,Ben Keigwin,Dan Jurafsky,Christopher Potts

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)

关键词：memory management techniques, Large language models, management techniques, user-specific PEFTs harms, Large language

备注：

点击查看摘要

Abstract:Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

86. 【2605.14194】GradShield: Alignment Preserving Finetuning

链接：https://arxiv.org/abs/2605.14194

作者：Zhanhao Hu,Xiao Huang,Patrick Mendoza,Emad A. Alghamdi,Basel Alomair,Raluca Ada Popa,David Wagner

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, pose a significant, significant risk

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

87. 【2605.14192】Why Retrieval-Augmented Generation Fails: A Graph Perspective

链接：https://arxiv.org/abs/2605.14192

作者：Kai Guo,Xinnan Dai,Zhibo Zhang,Nuohan Lin,Shenglai Zeng,Jie Ren,Haoyu Han,Jiliang Tang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：improving large language, large language models, powerful and widely, widely used approach, approach for improving

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

88. 【2605.14177】hinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

链接：https://arxiv.org/abs/2605.14177

作者：Harshita Chopra,Krishna Kant Chintalapudi,Suman Nath,Ryen W. White,Chirag Shah

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：extended interaction histories, personalization requires dialogue, requires dialogue assistants, retrieve user-specific facts, Long-horizon personalization requires

备注： Preprint

点击查看摘要

Abstract:Long-horizon personalization requires dialogue assistants to retrieve user-specific facts from extended interaction histories. In practice, many relevant facts often have low semanticsimilarity to the query under dense retrieval. Standard Retrieval-Augmented Generation (RAG) and GraphRAG systems are still largely retrospective: they rely on embedding similarity to the query or on fixed graph traversals, so they often miss facts that matter for the user's needs but lie far from the query in embedding space. Inspired by prospection, the human ability to use imagined futures as cues for recall, we introduce Prospection-Guided Retrieval (PGR), which decouples retrieval from how memories are stored. Given a user query, PGR first expands the goal into a short Tree-of-Thought (ToT) or linear chain of plausible next steps, and uses these steps as retrieval probes rather than relying on the original query alone. The facts retrieved by these probes are then used to personalize the next round of prospection, enabling PGR to uncover additional memories that become relevant only after the simulation is grounded in the user's history. We also introduce MemoryQuest, a challenging multi-session benchmark in which each query is annotated with 3--5 dated reference facts subject to a low query-reference similarity constraint. Across 1,625 queries spanning 185 user profiles from 3 publicly available datasets, PGR-TOT substantially improves retrieval, including nearly 3x recall on MemoryQuest over the strongest baseline. In pairwise LLM-as-judge comparisons against baselines, PGR-generated responses are preferred on 89--98% of queries, with blinded human annotations on held-out subsets showing the same trend. Overall, the results demonstrate that explicit prospection yields large gains in long-horizon retrieval and response quality relative to similarity-only baselines.

89. 【2605.14169】BOOKMARKS: Efficient Active Storyline Memory for Role-playing

链接：https://arxiv.org/abs/2605.14169

作者：Letian Peng,Ziche Liu,Yiming Huang,Longfei Yun,Kun Zhou,Yupeng Hou,Jingbo Shang

类目：Computation and Language (cs.CL)

关键词：maintain long-horizon consistency, RPA memory, role-playing agents, long-horizon consistency, RPA memory methods

备注：

点击查看摘要

Abstract:Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.

90. 【2605.14152】ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

链接：https://arxiv.org/abs/2605.14152

作者：Michael S. Lee,Yash Maurya,Drew Rein,Bert Herring,Jonathan Nguyen,Kyungho Song,Udari Madhushani Sehwag,Jiyeon Cho,Kaustubh Deshpande,Yeongkyun Jang,Jiyeon Joo,Minn Seok Choi,Evi Fuelle,Christina Q Knight,Joseph Brandifino,Max Fenkell

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

关键词：target high-stakes National, high-stakes National Security, increasingly target high-stakes, high-stakes National, adversarial NSPS benchmark

备注： 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at [this https URL](https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public)

点击查看摘要

Abstract:Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} this https URL, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

Comments:
16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

Cite as:
arXiv:2605.14152 [cs.CL]

(or
arXiv:2605.14152v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.14152

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

91. 【2605.14125】Polar probe linearly decodes semantic structures from LLMs

链接：https://arxiv.org/abs/2605.14125

作者：Pablo J. Diego-Simón,Pierre Orhan,Yair Lakretz,Jean-Rémi King

类目：Computation and Language (cs.CL)

关键词：networks bind concepts, artificial neural networks, neural networks bind, Large Language Models, form complex semantic

备注：

点击查看摘要

Abstract:How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

92. 【2605.14120】Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

链接：https://arxiv.org/abs/2605.14120

作者：Mashrekur Rahman

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：environmental reasoning systems, compress multispectral observations, natural-language environmental reasoning, Geospatial foundation models, dense embeddings increasingly

备注：

点击查看摘要

Abstract:Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated $R^2$ reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ($\Delta R^2$ up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen's $d = 1.10$, $p = 0.031$). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.

93. 【2605.14117】Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

链接：https://arxiv.org/abs/2605.14117

作者：Luis Lara,Aristides Milios,Zhi Hao Luo,Aditya Sharma,Ge Ya Luo,Christopher Beckham,Florian Golemo,Christopher Pal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：precisely control room, control room dimensions, professional floor plan, aesthetic quality, floor plans

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

94. 【2605.14115】When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

链接：https://arxiv.org/abs/2605.14115

作者：Yikun Han,Mengfei Lan,Halil Kilicoglu

类目：Computation and Language (cs.CL)

关键词：retrieval-augmented large language, Biomedical retrieval-augmented large, large language models, emphasizes answer accuracy, retrieval-augmented large

备注： Accepted by BioNLP 2026

点击查看摘要

Abstract:Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

95. 【2605.14087】Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

链接：https://arxiv.org/abs/2605.14087

作者：Mokshit Surana,Archit Rathod,Akshaj Satishkumar

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, inherently absorb toxic, web-scale corpora, inherently absorb

备注：

点击查看摘要

Abstract:Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.

96. 【2605.14084】CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

链接：https://arxiv.org/abs/2605.14084

作者：Mingzhi Zhu,Michele Merler,Raju Pavuluri,Stacy Patterson

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：long-horizon repository state, strict tool-use protocols, obey strict tool-use, reason over long-horizon, long-horizon repository

备注：

点击查看摘要

Abstract:Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

97. 【2605.14075】Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

链接：https://arxiv.org/abs/2605.14075

作者：Cristian Hinostroza,Rodrigo Toro Icarte,Christ Devia,Andres Carvallo De Ferari,Eugenio Herrera-Berg,Denis Parra,Jorge F Silva

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language models, natural language processing, revolutionized natural language, Large language, language processing

备注： Published at ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.

98. 【2605.14071】Distribution Corrected Offline Data Distillation for Large Language Models

链接：https://arxiv.org/abs/2605.14071

作者：Yumeng Zhang,Zhengbang Yang,Yevin Nikhel Goonatilake,Zhuangdi Zhu

类目：Computation and Language (cs.CL)

关键词：strong large language, large language models, Distilling reasoning traces, Distilling reasoning, resource-constrained settings

备注：

点击查看摘要

Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

99. 【2605.14062】Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

链接：https://arxiv.org/abs/2605.14062

作者：Anjir Ahmed Chowdhury,Syed Zawad,Feng Yan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：existing approaches typically, applying quality filters, approaches typically generate, typically generate full, generate full outputs

备注： 17 pages, 4 figures, 7 tables

点击查看摘要

Abstract:While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

100. 【2605.14057】Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

链接：https://arxiv.org/abs/2605.14057

作者：Xubo Lin,Zezhii Deng,Shihao Wang,Grace Hui Yang,Yang Deng

类目：Computation and Language (cs.CL)

关键词：fulfill user requests, existing dialogue systems, systems are user-driven, primarily designed, user requests

备注： Accepted in ACL 2026 as Findings

点击查看摘要

Abstract:Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.

101. 【2605.14055】PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

链接：https://arxiv.org/abs/2605.14055

作者：Anjir Ahmed Chowdhury,Syed Zawad,Xiaolong Ma,Xu Dong,Feng Yan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：multi-task learning, Large Language Models, Prefix Tuning, adapting Large Language, model

备注： 26 pages, 8 figures, 18 Tables

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

102. 【2605.14053】Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2605.14053

作者：Ignacio Sastre,Guillermo Moncecchi,Aiala Rosá

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, shown great promise, Large Language, Question Answering, erroneous reasoning arise

备注：

点击查看摘要

Abstract:The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

103. 【2605.14049】Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning

链接：https://arxiv.org/abs/2605.14049

作者：Olivia Peiyu Wang,Leilani H. Gilpin

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：large language models, legal practice brings, growing adoption, brings both significant, significant promise

备注： 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium

点击查看摘要

Abstract:The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

104. 【2605.14040】Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

链接：https://arxiv.org/abs/2605.14040

作者：Shan Yang

类目：Computation and Language (cs.CL)

关键词：measures vision-language reasoning, undetected construction practices, field measures vision-language, multimodal-physics evaluation pipeline, train-eval contamination

备注： 10 pages, 3 tables. Project page: [this https URL](https://shanyang.me/physics-r1-page/)

点击查看摘要

Abstract:We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard - mxbai-embed-large cosine - Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 - 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 - 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

105. 【2605.14037】Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

链接：https://arxiv.org/abs/2605.14037

作者：Gergely Szilvasy(1),Manuel Faysse(1 and 2),Maria Lomeli(1),Matthijs Douze(1),Pierre-Emmanuel Mazaré(1),Loïc Cabannes(1),Wen-tau Yih(1),Hervé Jégou(1) ((1) Meta FAIR, (2) MICS, CentraleSupélec)

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：modern test-time compute, models process ever-longer, language models process, process ever-longer sequences, agentic paradigms

备注： 28 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

Comments:
28 pages, 8 figures, 8 tables

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

ACMclasses:
I.2.6; I.2.7

Cite as:
arXiv:2605.14037 [cs.LG]

(or
arXiv:2605.14037v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.14037

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

106. 【2605.14036】Enhanced and Efficient Reasoning in Large Learning Models

链接：https://arxiv.org/abs/2605.14036

作者：Leslie G. Valiant

类目：Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：smoothly flowing prose, Large Language Models, current Large Language, Large Language, Language Models

备注：

点击查看摘要

Abstract:In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

Subjects:

Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)

ACMclasses:
I.2.6; I.2.7; F.2.2

Cite as:
arXiv:2605.14036 [cs.AI]

(or
arXiv:2605.14036v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.14036

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

107. 【2605.14034】From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

链接：https://arxiv.org/abs/2605.14034

作者：Jinxian Qu,Qingqing Gu,Teng Chen,Luo Ji

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：require strong alignment, Wide applications, LLM-based agents require, agents require strong, applications of LLM-based

备注： Accepted by CogSci 2026

点击查看摘要

Abstract:Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

108. 【2605.14005】Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

链接：https://arxiv.org/abs/2605.14005

作者：Shuoyang Sun,Chang Da,Hao Fang,Kuofeng Gao,Xinhao Zhong,Yi Sun,Fan Mo,Shu-Tao Xia,Bin Chen

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：widely adopted technique, accelerating large language, drafting multiple candidate, multiple candidate tokens, large language model

备注：

点击查看摘要

Abstract:Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $\tau$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $\tau$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

109. 【2605.13997】HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

链接：https://arxiv.org/abs/2605.13997

作者：Tao Zhong,Dongzhe Zheng,Christine Allen-Blanchette

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：layers route tokens, layers reduces inference, reduces inference cost, layers route, layers reduces

备注： 34 pages, 8 figures

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.

110. 【2605.13989】VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

链接：https://arxiv.org/abs/2605.13989

作者：Juan S. Santillana

类目：Computation and Language (cs.CL)

关键词：Model Context Protocol, native tool invocation, Context Protocol, decoder-only language model, language model trained

备注： 15 pages, 4 figures, preprint

点击查看摘要

Abstract:We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80-3.17-3.00-2.16); after SFT on OASST-ES, Alpaca-ES, CVE QA, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under this http URL, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

111. 【2605.13935】Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

链接：https://arxiv.org/abs/2605.13935

作者：Saba Ahmadi,Prasanna Parthasarathi,Yufei Cui

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：largely adapt reward-maximizing, adapt reward-maximizing objectives, Diffusion language models, largely adapt, adapt reward-maximizing

备注：

点击查看摘要

Abstract:Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.

112. 【2605.13919】Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

链接：https://arxiv.org/abs/2605.13919

作者：Kunil Lee,Ki-Young Shin,Jong-Hyeok Lee,Young-Joo Suh

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：language-specific edits interfere, Task Singular Vectors, remains challenging, vector merging methods, Multilingual knowledge editing

备注：

点击查看摘要

Abstract:Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.

113. 【2605.13880】PREPING: Building Agent Memory without Tasks

链接：https://arxiv.org/abs/2605.13880

作者：Yumin Choi,Sangwoo Park,Minki Kang,Jinheon Baek,Sung Ju Hwang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：typically constructed, curated demonstrations, memory, Agent, post-deployment interactions

备注： Preprint

点击查看摘要

Abstract:Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

114. 【2605.13858】A Hormone-inspired Emotion Layer for Transformer language models (HELT)

链接：https://arxiv.org/abs/2605.13858

作者：Eslam Reda,Sara El-Metwally

类目：Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, grammatically correct text, demonstrated remarkable capabilities, generating contextually relevant, Large Language

备注： 24 pages, 5 figures

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in generating contextually relevant and grammatically correct text. However, they fundamentally lack the ability to process and respond to emotional context in a manner analogous to human emotional cognition. Current approaches to emotion modeling in NLP systems rely primarily on discrete emotion classification or simplistic sentiment analysis, which fail to capture the continuous, multi-dimensional nature of human emotional states. In this paper, we introduce HormoneT5, a novel architecture that augments transformer language models with a biologically-inspired Hormone Emotion Block that simulates the human endocrine system's role in emotional processing. Our approach computes six continuous hormone-like values through specialized per-hormone attention heads, each with orthogonally initialized learnable queries, temperature-scaled attention mechanisms, and deep output projections. These hormone values are then transformed into an emotional embedding that modulates the encoder hidden states, enabling emotionally-appropriate response generation. We propose a multi-objective training framework combining sequence-to-sequence loss, hormone prediction loss with margin penalties, and diversity regularization to prevent attention collapse. Experimental results on our curated emotion-labeled dataset demonstrate that HormoneT5 achieves 85%+ per-hormone accuracy within a 0.15 tolerance threshold, with hormone differentiation ranges exceeding 0.85 across all six hormones between contrasting emotional tones. Human evaluation studies show significant preference (p 0.01) for HormoneT5-generated responses in terms of emotional appropriateness and empathetic quality compared to baseline T5 outputs. Our work opens new directions for biologically-grounded affective computing and emotionally intelligent conversational agents.

115. 【2605.13848】GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

链接：https://arxiv.org/abs/2605.13848

作者：Yeahia Sarker,Md Rahmat Ullah,Musa Molla,Shafiq Joty

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：Agentic LLM frameworks, Agentic LLM, determines workflow transitions, LLM frameworks, infinite loops

备注： 12 pages, 5 figures, 4 tables. Submitted to arXiv, under review

点击查看摘要

Abstract:Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

116. 【2605.12968】Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

链接：https://arxiv.org/abs/2605.12968

作者：Hisashi Miyashita,Mgnite Inc

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词：internally encode ontological, encode ontological relations, Liskov Substitution Principle, models internally encode, Algebraic Ontology Projection

备注：

点击查看摘要

Abstract:Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families -- with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse -- a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

Cite as:
arXiv:2605.12968 [cs.LG]

(or
arXiv:2605.12968v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.12968

Focus to learn more

              arXiv-issued DOI via DataCite</p>

117. 【2605.09027】GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

链接：https://arxiv.org/abs/2605.09027

作者：Alexandre Le Mercier,Chris Develder,Thomas Demeester

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：single deceptive agent, evade deployed defenses, single deceptive, nullify all gains, evade deployed

备注： 46 pages, 16 figures

点击查看摘要

Abstract:In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: this https URL.

118. 【2601.01972】Hidden State Poisoning Attacks against Mamba-based Language Models

链接：https://arxiv.org/abs/2601.01972

作者：Alexandre Le Mercier,Chris Develder,Thomas Demeester

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：linear time complexity, Transformer-based language models, offer efficient alternatives, alternatives to Transformer-based, Transformer-based language

备注： 29 pages, 4 figures

点击查看摘要

Abstract:State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM--Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at this https URL.

119. 【2605.14188】QOuLiPo: What a quantum computer sees when it reads a book

链接：https://arxiv.org/abs/2605.14188

作者：Christophe Jurczak

类目：Quantum Physics (quant-ph); Computation and Language (cs.CL); Digital Libraries (cs.DL); Atomic Physics (physics.atom-ph)

关键词：Augustine to Galileo, quantum computer, texts, quantum, graph

备注：

点击查看摘要

Abstract:What does a book look like to a quantum computer? This paper takes eight classical works of the Renaissance and its late-antique inheritance -- from Augustine to Galileo -- and runs each through a neutral-atom quantum processor. The bridge is graphs: each textual unit becomes an atom, and graph edges are physical blockade constraints for engineered exact unit-disk designs, or a 2D approximation to the semantic graph for natural texts. Three contributions follow. First, we introduce rigidity rho, a metric for how unique a book's structural backbone is -- distinguishing Marguerite de Navarre's Heptameron (rigid, twelve-nouvelle hard core) from Boethius (fully fungible, every chapter substitutable). Second, we invert the pipeline: rather than extracting a graph from existing prose, we pick a target graph the hardware encodes natively, and write a book whose structure matches it. The twenty-nine texts written this way, collected under the name QOuLiPo, extend the OuLiPo tradition to graph-topological constraints and, together with the eight natural texts, form a benchmark distribution against which neutral-atom hardware can be tracked as it scales. Third, we run both natural and engineered texts on Pasqal's FRESNEL processor up to one hundred atoms; engineered texts reach high approximation ratios, the cleanest instances returning the exact backbone. A cloud-accessible quantum machine plus an agentic coding environment now lets a single investigator run this pipeline end-to-end. What is reported is an application layer, not a speedup -- humanistic instances ready to load onto neutral-atom processors as they scale, already complementing classical text analysis. The Digital Humanities community has a stake in building familiarity with this hardware now: the engineered-corpus design choices made today fix the benchmark distribution future hardware will be measured against.

Subjects:

Quantum Physics (quant-ph); Computation and Language (cs.CL); Digital Libraries (cs.DL); Atomic Physics (physics.atom-ph)

Cite as:
arXiv:2605.14188 [quant-ph]

(or
arXiv:2605.14188v1 [quant-ph] for this version)

https://doi.org/10.48550/arXiv.2605.14188

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

120. 【2605.14098】Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

链接：https://arxiv.org/abs/2605.14098

作者：Yu Gu,Zijun Yu,Vahid Partovi Nia,Masoud Asgharian

类目：Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：aggregating multiple sampled, multiple sampled reasoning, self-consistency improves performance, sampled reasoning paths, performance by aggregating

备注： 9 pages, 4 figures, submitted

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation uncertainty the central challenge. This issue is critical where confidently incorrect answers are far more costly than abstentions. We introduce a conformal procedure for CoT reasoning that directly addresses aggregation uncertainty. Our approach replaces majority voting with weighted score aggregation over reasoning paths and calibrates an abstention rule using conformal risk control. This approach leads to finite-sample guarantees on the confident-error rate--the probability that the system answers and is wrong. We further identify score separability as the key condition under which abstention provably improves selective accuracy, and derive closed-form expressions that predict accuracy gains from calibration data alone. The method is fully inference-time, and requires no retraining. Across four benchmarks, four open-source models, and three score classes, realized confident-error rates are consistent with the prescribed targets up to calibration-split and test-set variability. Our method achieves $90.1\%$ selective accuracy on GSM8K by abstaining on less than $5\%$ of problems, compared with $82\%$ accuracy under majority-voting baseline.

121. 【2605.14066】A Benchmark for Early-stage Parkinson's Disease Detection from Speech

链接：https://arxiv.org/abs/2605.14066

作者：Terry Yi Zhong,Cristian Tejedor-Garcia,Khiet P. Truong,Janna Maas,Louis ten Bosch,Bastiaan R. Bloem

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词：Early-stage Parkinson disease, Early-stage Parkinson, Parkinson disease, hard to compare, compare because studies

备注： Submitted to Interspeech2026

点击查看摘要

Abstract:Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.

信息检索

1. 【2605.15128】MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

链接：https://arxiv.org/abs/2605.15128

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：existing evaluations rarely, evaluations rarely test, existing evaluations, evaluations rarely, rarely test

备注： 46 pages, 15 figures

点击查看摘要

2. 【2605.15109】Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

链接：https://arxiv.org/abs/2605.15109

作者：Riccardo Terrenzi,Maximilian von Zastrow,Serkan Ayvaz

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Retrieval-Augmented Generation, Generation can improve, Agentic GraphRAG complicates, improve factuality, factuality by grounding

备注： 7 pages, 2 figures, Submitted at IJCAI-ECAI 2026 Joint Workshop on GENAIK and NORA

点击查看摘要

Abstract:Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

3. 【2605.15079】Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

链接：https://arxiv.org/abs/2605.15079

作者：Rafi Al Attrach,Rajna Fani,Sebastian Lobentanzer,Joan Giner-Miguelez,Debanshu Das,Varuni H. K.,Nobin Sarwar,Rajat Ghosh,Anwai Archit,Surbhi Motghare,Christina Conrad Parry,Luis Oala,Lara Grosso,Joaquin Vanschoren,Steffen Vogler,Sujata Goswami,Eric S. Rosenthal,Marzyeh Ghassemi,Matthew McDermott,Tom Pollard

类目：Machine Learning (cs.LG); Databases (cs.DB); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：reproducible analysis machine-checkable, makes dataset discovery, machine learning datasets, Croissant Baker, providing a structured

备注： 23 pages, 5 figures, 11 tables. Project: [this https URL](https://lcp.mit.edu/croissant-baker/) Code: [this https URL](https://github.com/MIT-LCP/croissant-baker)

点击查看摘要

Abstract:Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

4. 【2605.14857】A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

链接：https://arxiv.org/abs/2605.14857

作者：Yu Zhang,Dongjiang Zhuang,Qu Zhou,Zheng Huang,Junhe Wu,Jing Cao,Kai Chen

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Harmonized System, free-form product description, General Interpretive Rules, Interpretive Rules, Explanatory Notes

备注：

点击查看摘要

Abstract:Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

5. 【2605.14853】Discrimination Is Generation: Unifying Ranking and Retrieval from a Tokenizer Perspective

链接：https://arxiv.org/abs/2605.14853

作者：Shuli Wang,Junwei Yin,Changhao Li,Senjie Kou,Chi Wang,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang

类目：Information Retrieval (cs.IR)

关键词：Semantic IDs, define the generation, directly determine, personalization ceiling, Semantic

备注：

点击查看摘要

Abstract:Semantic IDs (SIDs) define the generation space of generative recommendation and directly determine its personalization ceiling. However, existing tokenizers are trained independently with retrieval objectives, leaving personalization signals fully decoupled from the SID construction process -- a fundamental gap that causes generative retrieval to persistently lag behind discriminative ranking. In this paper, we rethink the essence of SIDs: \emph{ranking seeks argmax in item space while retrieval seeks argmax in token space; both are the same problem solved at different granularities.} Based on this insight, we propose \DIG (\textbf{D}iscrimination \textbf{I}s \textbf{G}eneration), which embeds the tokenizer inside a discriminative ranking model for end-to-end training -- the ranker naturally becomes a retrieval model, yielding two models from a single training run. \DIG is organized around a \emph{feature assignment taxonomy}: item-intrinsic static features are encoded into SIDs, user-item cross features (u2i) implicitly drive codebook boundaries toward recommendation decision boundaries during training, and an MLP$_\mathrm{u2t}$ distillation module approximates u2i at the token level for inference. Experiments on three public benchmarks and two industrial datasets demonstrate that \DIG simultaneously improves ranking, retrieval, and unified retrieval-ranking quality.

6. 【2605.14665】Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

链接：https://arxiv.org/abs/2605.14665

作者：Joy Bose

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：semantic similarity search, similarity search, semantic similarity, Verifier Agent, Supreme Court

备注： 20 pages, 8 figures, 4 tables

点击查看摘要

7. 【2605.14581】A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

链接：https://arxiv.org/abs/2605.14581

作者：Ho Hung Lim,Yi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：traditional RAG, Visual RAG, RAG, RAG has offered, offered an alternative

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

8. 【2605.14512】Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

链接：https://arxiv.org/abs/2605.14512

作者：Bin Huang,Xin Wang,Junwei Pan,Yongqi Zhou,Yifeng Zhou,Zhixiang Feng,Shudong Huang,Haijie Gu,Wenwu Zhu

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：models reformulate recommendation, sequence generation task, reformulate recommendation, models reformulate, generation task

备注：

点击查看摘要

Abstract:Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer's hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.

9. 【2605.14450】Stop Overthinking: Unlocking Efficient Listwise Reranking with Minimal Reasoning

链接：https://arxiv.org/abs/2605.14450

作者：Danyang Liu,Kan Li

类目：Information Retrieval (cs.IR)

关键词：utilizing Large Language, Large Language Models, Listwise reranking utilizing, reranking utilizing Large, Large Language

备注：

点击查看摘要

Abstract:Listwise reranking utilizing Large Language Models (LLMs) has achieved state-of-the-art retrieval effectiveness. Recently, reasoning-enhanced models have further pushed these boundaries by employing Chain-of-Thought (CoT) to perform deep comparative analysis of candidate documents. However, this performance gain comes at a prohibitive computational cost, as models often generate thousands of reasoning tokens before producing a final ranking. In this work, we investigate the relationship between reasoning length and ranking quality, revealing an overthinking phenomenon where extended reasoning yields diminishing returns. To address this, we propose a Length-Regularized Self-Distillation framework. We synthesize a dataset by sampling diverse reasoning traces from a teacher model (Rank-K) and applying a Pareto-inspired filter to select traces that achieve high ranking performance with minimal token usage. By fine-tuning on these concise, high-quality rationales, the student model learns to internalize efficient reasoning patterns, effectively pruning redundant deliberation. Experiments on TREC Deep Learning and NeuCLIR benchmarks demonstrate that our method maintains the teacher's effectiveness while reducing inference token consumption by 34%-37% across different retrieval settings, offering a practical solution for deploying reasoning-enhanced rerankers in latency-sensitive applications.

10. 【2605.14448】hink When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

链接：https://arxiv.org/abs/2605.14448

作者：Longxiang Zhang,Weilong Dai,Guanghao Zhang,Hao Jiang,Pipei Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Multimodal large language, large language models, large language, Multimodal large, reasoning

备注： 30 pages, preprint

点击查看摘要

11. 【2605.14434】Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

链接：https://arxiv.org/abs/2605.14434

作者：Jianbo Zhu,Xing Fang,Jing Wang,Mingmin Jin,Bokang Wang,Guangxin Song,Zhenyu Xie,Junjie Bai

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：multi-stage retrieval process, fragmented multi-stage retrieval, Generative retrieval offers, offers a promising, promising alternative

备注：

点击查看摘要

Abstract:Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.

12. 【2605.14306】owards Self-Evolving Agentic Literature Retrieval

链接：https://arxiv.org/abs/2605.14306

作者：Yuwen Du,Tian Jin,Jing Kang,Xianghe Pang,Jingyi Chai,Tingjia Miao,Fenyi Liu,WenHao Wang,Sikai Yao,Yuzhi Zhang,Siheng Chen

类目：Information Retrieval (cs.IR)

关键词：language models reshape, twofold challenge, faces a twofold, authenticity while maintaining, maintaining a deep

备注：

点击查看摘要

Abstract:As large language models reshape scientific research, literature retrieval faces a twofold challenge: ensuring source authenticity while maintaining a deep comprehension of academic search intents. While reliable, traditional keyword-centric search fails to capture complex research intents. Frontier LLMs can handle complex research intents, but their high cost and tendency to hallucinate remain key limitations. Here we introduce PaSaMaster, a self-evolving agentic literature retrieval system that produces relevance-scored paper rankings with evidence-grounded recommendations through iterative intent analysis, retrieval, and ranking. It is built on three key designs. First, it transforms literature retrieval from a one shot query--document matching problem into a search process that evolves over time, using ranked evidence to reveal gaps, refine intents, and guide follow-up searches. Second, it prevents hallucinated sources by treating retrieval as intent--paper relevance ranking rather than generation. Finally, PaSaMaster improves cost efficiency by separating planning from retrieval: a frontier LLM is used only for intent understanding, while large scale retrieval and relevance scoring are delegated to customized corpora and lightweight models. Evaluated on the PaSaMaster Benchmark across 38 scientific disciplines, our system exposes the severe inaccuracy and incompleteness of traditional keyword retrieval (improving F1-score by 15.6X) and the unreliability of generative LLMs (which exhibit hallucination rates up to 37.79%). Remarkably, PaSaMaster outperforms GPT-5.2 by 30.0% at a mere 1% of the computational cost while ensuring zero source hallucination: this https URL

13. 【2605.14177】hinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

链接：https://arxiv.org/abs/2605.14177

作者：Harshita Chopra,Krishna Kant Chintalapudi,Suman Nath,Ryen W. White,Chirag Shah

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：extended interaction histories, personalization requires dialogue, requires dialogue assistants, retrieve user-specific facts, Long-horizon personalization requires

备注： Preprint

点击查看摘要

14. 【2605.15108】Logging Policy Design for Off-Policy Evaluation

链接：https://arxiv.org/abs/2605.15108

作者：Connor Douglas,Joel Persson,Foster Provost

类目：Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)

关键词：Off-policy evaluation, Off-policy, logging, policy, logging policy

备注：

点击查看摘要

Abstract:Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

计算机视觉

1. 【2605.15199】EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

链接：https://arxiv.org/abs/2605.15199

作者：Ruozhen He,Meng Wei,Ziyan Yang,Vicente Ordonez

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multi-shot video generation, Multi-shot video, maintaining consistent characters, video generation extends, generation extends single-shot

备注： Project page: [this https URL](https://catherine-r-he.github.io/EntityBench/)

点击查看摘要

Abstract:Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at this https URL.

2. 【2605.15198】ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

链接：https://arxiv.org/abs/2605.15198

作者：Ziyu Guo,Rain Liu,Xinyan Chen,Pheng-Ann Heng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：promising direction, intermediate visual states, Visual, reasoning, Visual reasoning

备注： Project Page: [this https URL](https://atlas-oneword.github.io) Code: [this https URL](https://github.com/ZiyuGuo99/ATLAS)

点击查看摘要

3. 【2605.15196】RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

链接：https://arxiv.org/abs/2605.15196

作者：Xiang Fan,Yuheng Wang,Bohan Fang,Zhongzheng Ren,Ranjay Krishna

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：downstream applications, powers a vast, vast array, array of downstream, Video generation powers

备注：

点击查看摘要

Abstract:Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

4. 【2605.15195】VGGT-$Ω$

链接：https://arxiv.org/abs/2605.15195

作者：Jianyuan Wang,Minghao Chen,Shangzhan Zhang,Nikita Karaev,Johannes Schönberger,Patrick Labatut,Piotr Bojanowski,David Novotny,Andrea Vedaldi,Christian Rupprecht

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent feed-forward reconstruction, traditional optimization-based reconstructors, providing geometry-aware features, Recent feed-forward, proven competitive

备注： CVPR 2026 (Oral)

点击查看摘要

Abstract:Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$\Omega$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$\Omega$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$\Omega$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: this http URL

5. 【2605.15193】Aligning Latent Geometry for Spherical Flow Matching in Image Generation

链接：https://arxiv.org/abs/2605.15193

作者：Tuna Han Salih Meral,Kaan Oktay,Hidir Yesiltepe,Adil Kaan Akan,Pinar Yanardag

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Latent flow matching, variational autoencoder latents, transports Gaussian noise, flow matching, generation usually transports

备注：

点击查看摘要

Abstract:Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

6. 【2605.15190】RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

链接：https://arxiv.org/abs/2605.15190

作者：Yanzuo Lu,Ronglai Zuo,Jiankang Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：previously generated content, support real-time streaming, diffusion models support, models support real-time, real-time streaming generation

备注： Project Page: [this https URL](https://yanzuo.lu/raven)

点击查看摘要

Abstract:Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

7. 【2605.15187】Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

链接：https://arxiv.org/abs/2605.15187

作者：Matt Zhou,Ruining Li,Xiaoyang Lyu,Zhaomou Song,Zhening Huang,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi,Shangzhe Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词：bottleneck in learning, learning to understand, articulated assets, articulated, understand articulated

备注： Project page: [this https URL](https://articraft3d.github.io/)

点击查看摘要

Abstract:A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

8. 【2605.15186】VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

链接：https://arxiv.org/abs/2605.15186

作者：Kaixin Zhu,Yiwen Tang,Yifan Yang,Renrui Zhang,Bohan Zeng,Ziyu Guo,Ruichuan An,Zhou Liu,Qizhi Chen,Delin Qu,Jaehong Yoon,Wentao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single forward pass, generalizable feed-forward architectures, enabling the generation, forward pass, reconstruction has recently

备注：

点击查看摘要

Abstract:High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.

9. 【2605.15185】Quantitative Video World Model Evaluation for Geometric-Consistency

链接：https://arxiv.org/abs/2605.15185

作者：Jiaxin Wu,Yihao Pi,Yinling Zhang,Yuheng Li,Xueyan Zou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：motion remains challenging, Generative video models, produce physically plausible, Perspective Distortion Index, Generative video

备注： 12 pages, 5 figures. Project page : [this https URL](https://pdi-bench.github.io/)

点击查看摘要

Abstract:Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at this https URL.

10. 【2605.15182】Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

链接：https://arxiv.org/abs/2605.15182

作者：Yifan Wang,Tong He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made substantial progress, enabling generated videos, prescribed viewpoint trajectories, Camera-controlled video generation, follow prescribed viewpoint

备注： Project page: [this https URL](https://yyfz.github.io/warp-as-history/)

点击查看摘要

Abstract:Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

11. 【2605.15181】From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

链接：https://arxiv.org/abs/2605.15181

作者：Anirudh Sundara Rajan,Krishna Kumar Singh,Yong Jae Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models produce realistic, produce realistic results, Modern image editing, editing models produce, Modern image

备注：

点击查看摘要

Abstract:Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

12. 【2605.15178】SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

链接：https://arxiv.org/abs/2605.15178

作者：Haoyi Zhu,Haozhe Liu,Yuyang Zhao,Tian Ye,Junsong Chen,Jincheng Yu,Tong He,Song Han,Enze Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：model natively trained, synthesizing high-fidelity, world model natively, Hybrid Linear Attention, model natively

备注： [this https URL](https://nvlabs.github.io/Sana/WM/)

点击查看摘要

Abstract:We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

13. 【2605.15171】Evidential Reasoning Advances Interpretable Real-World Disease Screening

链接：https://arxiv.org/abs/2605.15171

作者：Chenyu Lian,Hong-Yu Zhou,Jing Qin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：clinical practice, critical for early, early detection, detection and timely, timely intervention

备注： ICML 2026

点击查看摘要

Abstract:Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at this https URL.

14. 【2605.15167】Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

链接：https://arxiv.org/abs/2605.15167

作者：Kam Man Wu,Haolin Yang,Qingyu Chen,Yihu Tang,Jingye Chen,Qifeng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce high-quality images, Recent advances, image generation, high-quality images, generation have made

备注： 22 pages, 10 figures. Code is available at [this https URL](https://github.com/YangHaolin0526/SynLayers)

点击查看摘要

Abstract:Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

15. 【2605.15141】Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

链接：https://arxiv.org/abs/2605.15141

作者：Min Zhao,Hongzhou Zhu,Kaiwen Zheng,Zihan Zhou,Bokai Yan,Xinyuan Li,Xiao Yang,Chongxuan Li,Jun Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-time interactive video, Real-time interactive, generation requires low-latency, interactive video generation, video generation requires

备注：

点击查看摘要

Abstract:Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: this https URL and this https URL .

16. 【2605.15128】MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

链接：https://arxiv.org/abs/2605.15128

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：existing evaluations rarely, evaluations rarely test, existing evaluations, evaluations rarely, rarely test

备注： 46 pages, 15 figures

点击查看摘要

17. 【2605.15120】CLOVER: Closed-Loop Value Estimation \ Ranking for End-to-End Autonomous Driving Planning

链接：https://arxiv.org/abs/2605.15120

作者：Sining Ang,Yuguang Yang,Canyu Chen,Yan Wang

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：single logged trajectory, rule-based planning metrics, measure safety, commonly trained, trained by imitating

备注：

点击查看摘要

Abstract:End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at this https URL.

18. 【2605.15116】DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

链接：https://arxiv.org/abs/2605.15116

作者：Haonan Zhao,Yiting Wang,Jingkun Chen,Valentina Donzella,Thomas Bashford-Rogers,Kurt Debattista

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large-scale labelled driving, Large-scale labelled, autonomous driving systems, training autonomous driving, labelled driving video

备注：

点击查看摘要

Abstract:Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.

19. 【2605.15093】CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

链接：https://arxiv.org/abs/2605.15093

作者：Jess Jones,Leonardo Bertini,Kenneth Johnson,Erica Hendy,Tilo Burghardt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：life history, corallite, colony, reef-forming coral colonies, accreting skeleton

备注： 15 pages, 10 figures, 2 tables

点击查看摘要

Abstract:The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive \emph{Porites} sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated {\mu}CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled {\mu}CT virtual slabs of \emph{Porites} sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from {\mu}CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 {\mu}CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

20. 【2605.15088】SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection

链接：https://arxiv.org/abs/2605.15088

作者：Batuhan Arda Bekar,Can Sarı,Hüseyin Can Gülkan,Barış Özcan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hybrid Transformer-based model, airborne LiDAR point, hybrid Transformer-based, Transformer-based model, LiDAR point clouds

备注： 5 pages, 4 figures

点击查看摘要

Abstract:We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.

21. 【2605.15071】On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

链接：https://arxiv.org/abs/2605.15071

作者：Mukul Ranjan,Prince Jha,Khushboo Kumari,Zhiqiang Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：educational platforms, increasingly applied, digital archives, archives to educational, cultural heritage materials

备注： Project Page: [this https URL](https://khushboo0012.github.io/tab-vlm-webpage/)

点击查看摘要

22. 【2605.15062】Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection

链接：https://arxiv.org/abs/2605.15062

作者：Chengshuai Yang,Lei Xing,Gregory Entin,Roopa Vemulapalli,Lisa Casey,Raiyan Tripti Zaman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Toggle, Monte Carlo-inspired analytic, Toggle Hugging Face, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic

备注： 24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at [this https URL](https://github.com/integritynoble/GI_Multi_Task) . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A

点击查看摘要

Abstract:Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior's per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 - 0.916, but the cross-seed mean is 0.646 - 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.

Comments:
24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at this https URL . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68T07, 68T45, 92C55

Cite as:
arXiv:2605.15062 [cs.CV]

(or
arXiv:2605.15062v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.15062

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Chengshuai Yang [view email] [v1]
Thu, 14 May 2026 16:52:33 UTC (4,462 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection, by Chengshuai Yang and 5 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

|
next

new
|
recent
| 2026-05

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

23. 【2605.15055】DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

链接：https://arxiv.org/abs/2605.15055

作者：Quanhao Li,Junqiu Yu,Kaixun Jiang,Yujie Wei,Zhen Xing,Pandeng Li,Ruihang Chu,Shiwei Zhang,Yu Liu,Zuxuan Wu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Reinforcement learning, improving diffusion-based, learning has emerged, powerful tool, tool for improving

备注：

点击查看摘要

Abstract:Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

24. 【2605.15054】LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

链接：https://arxiv.org/abs/2605.15054

作者：Mitchell Piehl,Muchao Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language-based explainability, strong visual reasoning, visual reasoning ability, Vision-language models, language-based explainability

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

25. 【2605.15042】EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

链接：https://arxiv.org/abs/2605.15042

作者：Wuyang Li,Yang Gao,Mariam Hassan,Lan Feng,Wentao Pan,Po-Chien Luan,Alexandre Alahi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：preserves visual quality, animated video generation, efficient post-training method, long-horizon animated video, efficient post-training

备注： Project Page: [this https URL](https://everanimate.github.io/homepage/)

点击查看摘要

Abstract:We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

26. 【2605.15024】HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

链接：https://arxiv.org/abs/2605.15024

作者：Man Wang,Chenyang Liu,Wenjun Li,Feng Ni,Bing Jia,Baoqi Huang,Riting Xia,Zhenwei Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing image, Remote sensing, achieve high-level semantic, high-level semantic understanding, genuine changes occurring

备注：

点击查看摘要

Abstract:Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic this http URL address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at this https URL

27. 【2605.15010】3D Skew-Normal Splatting

链接：https://arxiv.org/abs/2605.15010

作者：Xiangru Wu,Ke Fan,Yanwei Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：downstream applications, real-time novel view, widely adopted, Gaussian Splatting, Gaussian primitives provide

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions, yet they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moremover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization this http URL, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

28. 【2605.14991】Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

链接：https://arxiv.org/abs/2605.14991

作者：Francesco Pastori,Francesca Fati,Marina Rosanu,Luigi De Vitis,Lucia Ribero,Gabriella Schivardi,Giovanni Damiano Aletti,Nicoletta Colombo,Jvan Casarin,Francesco Multinu,Elena De Momi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：lethal gynecologic malignancy, Ovarian cancer, gynecologic malignancy, advanced stage, survival rate

备注：

点击查看摘要

Abstract:Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

29. 【2605.14990】Characterizing the visual representation of objects from the child's view

链接：https://arxiv.org/abs/2605.14990

作者：Jane Yang,Tarun Sepuri,Alvin Wei Ming Tan,Khai Loong Aw,Michael C. Frank,Bria Long

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：years of life, Children acquire object, acquire object category, object category representations, Children

备注： 19 pages, 6 figures

点击查看摘要

Abstract:Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.

30. 【2605.14988】Compositional Video Generation via Inference-Time Guidance

链接：https://arxiv.org/abs/2605.14988

作者：Ariel Shaulov,Eitan Shaar,Amit Edenzon,Gal Chechik,Lior Wolf

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate realistic videos, diffusion models generate, models generate realistic, fine-grained compositional understanding, realistic videos

备注：

点击查看摘要

Abstract:Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

31. 【2605.14984】Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

链接：https://arxiv.org/abs/2605.14984

作者：Ming Qian,Zimin Xia,Changkun Liu,Shuailei Ma,Wen Wang,Zeran Ke,Bin Tan,Hang Zhang,Gui-Song Xia

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single satellite image, Generating a street-level, challenging task, single satellite, satellite image

备注： ICLR 2026; code: [this https URL](https://github.com/qianmingduowan/Sat3DGen) demo: [this https URL](https://huggingface.co/spaces/qian43/Sat3DGen) project page: [this https URL](https://qianmingduowan.github.io/Sat3DGen_project_page/)

点击查看摘要

Abstract:Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on this https URL.

32. 【2605.14980】MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

链接：https://arxiv.org/abs/2605.14980

作者：Xiaofei Hui,Haoxuan Qu,Hossein Rahmani,Shuohong Wang,Jeff W. Lichtman,Jun Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Analyzing microscopy images, biological object properties, Analyzing microscopy, temporal dynamics, extract biological object

备注：

点击查看摘要

Abstract:Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

33. 【2605.14966】MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

链接：https://arxiv.org/abs/2605.14966

作者：Wei Ding,Yilin Li,Yudong Zhang,Ruobing Xie,Xingwu Sun,Jiansheng Chen,Yu Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large vision-language models, diverse multimodal tasks, achieved remarkable performance, Large vision-language, Cross-modal Attention

备注： 19 pages, 17 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

34. 【2605.14963】H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

链接：https://arxiv.org/abs/2605.14963

作者：Chenxing Jiang,Zhe Tong,Pusen Gao,Peize Liu,Yang Xu,Chuan Fang,Ping Tan,Shaojie Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vertically aligned epipolar, aligned epipolar lines, epipolar lines enable, advanced perspective stereo, perspective stereo architectures

备注： 8 pages, 9 figures

点击查看摘要

Abstract:Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical this http URL address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV this http URL experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. Both the model and the dataset will be open-sourced.

35. 【2605.14960】Meschers: Geometry Processing of Impossible Objects

链接：https://arxiv.org/abs/2605.14960

作者：Ana Dodik,Isabella Yu,Kartik Chandra,Jonathan Ragan-Kelley,Joshua Tenenbaum,Vincent Sitzmann,Justin Solomon

类目：Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)

关键词：satisfying computer representation, real life, visual arts, humans can perceive, topic of intrigue

备注：

点击查看摘要

Abstract:Impossible objects, geometric constructions that humans can perceive but that cannot exist in real life, have been a topic of intrigue in visual arts, perception, and graphics, yet no satisfying computer representation of such objects exists. Previous work embeds impossible objects in 3D, cutting them or twisting/bending them in the depth axis. Cutting an impossible object changes its local geometry at the cut, which can hamper downstream graphics applications, such as smoothing, while bending makes it difficult to relight the object. Both of these can invalidate geometry operations, such as distance computation. As an alternative, we introduce Meschers, meshes capable of representing impossible constructions akin to those found in M.C. Escher's woodcuts. Our representation has a theoretical foundation in discrete exterior calculus and supports the use-cases above, as we demonstrate in a number of example applications. Moreover, because we can do discrete geometry processing on our representation, we can inverse-render impossible objects. We also compare our representation to cut and bend representations of impossible objects.

36. 【2605.14950】Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

链接：https://arxiv.org/abs/2605.14950

作者：Tao Lin,Yuxin Du,Jiting Liu,Nuobei Zhu,Yunhe Li,Yuqian Fu,Yinxinyu Chen,Hongyi Cai,Zewei Ye,Bing Cheng,Kai Ye,Yiran Mao,Yilei Zhong,MingKang Dong,Junchi Yan,Gen Li,Bo Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：language grounding, unifying perception, promising paradigm, paradigm for robotic, Depth Encoding Module

备注：

点击查看摘要

Abstract:Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

37. 【2605.14949】A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

链接：https://arxiv.org/abs/2605.14949

作者：Aueaphum Aueawatthanaphisut

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

关键词：transient ischemic attack, ischemic attack, ischemic stroke, transient ischemic, major contributor

备注： 13 pages, 5 figures, 2 tables, 20 equations, 3 appendices

点击查看摘要

Abstract:Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

38. 【2605.14948】ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

链接：https://arxiv.org/abs/2605.14948

作者：Yuehao Liu,Weijia Zhang,Xuanming Shang,Zhizhou Chen,Yanhao Ge,Shanyan Guan,Chao Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：perform specialized image, image editing, perform specialized, specialized image editing, image

备注：

点击查看摘要

Abstract:State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.

39. 【2605.14938】Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

链接：https://arxiv.org/abs/2605.14938

作者：Yuehao Liu,Shanyan Guan,Weijia Zhang,Xuanming Shang,Yanhao Ge,Wei Li,Chao Ma

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：rehearsal-based methods rely, large language models, mitigating catastrophic forgetting, face inherent limitations, architecture-based approaches incur

备注：

点击查看摘要

Abstract:Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

40. 【2605.14935】Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

链接：https://arxiv.org/abs/2605.14935

作者：Nhat Le,Daochang Liu,Anh Nguyen,Ajmal Mian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human motion synthesis, test-time human motion, control, test-time human, token

备注：

点击查看摘要

Abstract:We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.

41. 【2605.14926】SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

链接：https://arxiv.org/abs/2605.14926

作者：Hanxu Zhang,Chen Jia,Hui Liu,Xu Cheng,Fan Shi,Shengyong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse scenarios remains, pixel-level accurate segmentation, Achieving pixel-level accurate, formidable challenge, pixel-level accurate

备注： Accept by ICML2026

点击查看摘要

Abstract:Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at this https URL.

42. 【2605.14925】Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

链接：https://arxiv.org/abs/2605.14925

作者：Yunsong Fang(1),Tingyu Wang(2),Zhedong Zheng(1) ((1) University of Macau, (2) Hangzhou Dianzi University)

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Drone-view geo-localization aims, query drone image, geo-tagged satellite images, Drone-view geo-localization, geo-localization aims

备注： 18 pages, 4 figures

点击查看摘要

Abstract:Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

43. 【2605.14923】SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

链接：https://arxiv.org/abs/2605.14923

作者：Pengxin Xu,Xincheng Lin,Luping Xiao,Qing Jiang,Meishan Zhang,Hao Fei,Shanghang Zhang,Xingyu Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：General scene perception, General scene, open-vocabulary grounding, perception has progressed, recognition toward open-vocabulary

备注： Preprint. Code, models, and dataset are provided in the manuscript

点击查看摘要

Abstract:General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene - object - part - affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

44. 【2605.14913】Representative Attention For Vision Transformers

链接：https://arxiv.org/abs/2605.14913

作者：Yuntong Li,Hainuo Wang,Hengxing Liu,Mingjia Li,Xiaojie Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dense self-attention, promising direction, cost of dense, tokens, spatial tokens

备注：

点击查看摘要

Abstract:Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each this http URL reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

45. 【2605.14908】SteerSeg: Attention Steering for Reasoning Video Segmentation

链接：https://arxiv.org/abs/2605.14908

作者：Ali Cheraghian,Hamidreza Dastmalchi,Abdelwahed Khamis,Morteza Saberi,Aijun An,Lars Petersson

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language expressions, video frames, segmentation requires localizing, requires localizing objects, language expressions

备注： Project page: [this https URL](https://steerseg.github.io)

点击查看摘要

Abstract:Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: this https URL

46. 【2605.14906】MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

链接：https://arxiv.org/abs/2605.14906

作者：Xiyu Ren,Zhaowei Wang,Yiming Du,Zhongwei Xie,Chi Liu,Xinlin Yang,Haoyue Feng,Wenjun Pan,Tianshi Zheng,Baixuan Xu,Zhengnan Li,Yangqiu Song,Ginny Wong,Simon See

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large vision-language models, method directions providing, vision-language models, handle long, providing this capability

备注： Work in progress

点击查看摘要

Abstract:Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at this https URL.

47. 【2605.14894】SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

链接：https://arxiv.org/abs/2605.14894

作者：Zheng Hui,Yunlong Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent breakthroughs, One-step Diffusion Transformer, video editing techniques, video diffusion models, video Subtitle Erasure

备注： Project page: [this http URL](http://zheng222.github.io/SEDiT_project)

点击查看摘要

Abstract:Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

48. 【2605.14893】Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

链接：https://arxiv.org/abs/2605.14893

作者：Jakub Grzywaczewski,Dawid Płudowski,Przemysław Biecek

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Contrastively pre-trained Vision-Language, pre-trained Vision-Language Models, powerful feature extractors, Contrastively pre-trained, Vision-Language Models

备注：

点击查看摘要

Abstract:Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

49. 【2605.14891】Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

链接：https://arxiv.org/abs/2605.14891

作者：Isma Hadji,Enrique Sanchez,Adrian Bulat,Brais Martinez,Georgios Tzimiropoulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image Super Resolution, Super Resolution, Visual Auto-Regressive, advances in Visual, multi-scale Image Super

备注： Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with [arXiv:2506.04990](https://arxiv.org/abs/2506.04990)

点击查看摘要

Abstract:We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

50. 【2605.14889】SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

链接：https://arxiv.org/abs/2605.14889

作者：Sukju Oh,Sukkyu Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：underpins context-aware operating-room, context-aware operating-room systems, surgical phase recognition, Online surgical phase, phase recognition

备注： 28 pages, 7 figures, 10 tables; Code available at [this https URL](https://github.com/sukjuoh/Surgical-Mamba)

点击查看摘要

Abstract:Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at this https URL.

51. 【2605.14885】Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

链接：https://arxiv.org/abs/2605.14885

作者：Zhuohao Chen,Zeng Li,Yifei Zhang,Chang Liu,Yu Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recognition requires modeling, Text Recognition requires, fine-grained character strokes, Scene Text Recognition, Recognition requires

备注： Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures

点击查看摘要

Abstract:Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at this https URL

52. 【2605.14880】Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

链接：https://arxiv.org/abs/2605.14880

作者：Qingyuan Zhou,Xinyi Liu,Weidong Yang,Ning Wang,Shuquan Ye,Ben Fei,Ying He,Wanli Ouyang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：View Synthesis, achieved remarkable success, Gaussian Splatting, Recent advances, Gaussian primitives due

备注：

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

53. 【2605.14877】HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

链接：https://arxiv.org/abs/2605.14877

作者：Jonathan Cederlund,Axel Berg,Durmus Alp Emre Acar,Chuteng Zhou,Pontus Giselsson

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Autoregressive, recently demonstrated impressive, maintaining low latency, demonstrated impressive image, impressive image generation

备注： 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.

54. 【2605.14876】Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

链接：https://arxiv.org/abs/2605.14876

作者：Hanbo Cheng,Limin Lin,Ruo Zhang,Yicheng Pan,Jun Du

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：faces diminishing returns, single-step generation paradigm, models predominantly rely, rapid advancements, predominantly rely

备注：

点击查看摘要

Abstract:Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $\Delta$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.14876 [cs.CV]

(or
arXiv:2605.14876v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.14876

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

55. 【2605.14874】LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

链接：https://arxiv.org/abs/2605.14874

作者：Yixin Liu,Baihong Qian,Jinglin Jiang,Jeffery Wu,Yan Chen,Wei Wang,Yida Wang,Lanqing Yang,Guangtao Xue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：synthesize photorealistic images, garments precisely aligned, Virtual Try-On, aims to synthesize, body and pose

备注：

点击查看摘要

Abstract:Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.

56. 【2605.14854】FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

链接：https://arxiv.org/abs/2605.14854

作者：Patrick Kwon,Chen Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Human Mesh Recovery, weak depth cues, Human Mesh, fundamentally ambiguous, depth cues

备注：

点击查看摘要

Abstract:Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

57. 【2605.14847】SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

链接：https://arxiv.org/abs/2605.14847

作者：Ivan Molodetskikh,Kirill Malyshev,Mark Mirgaleev,Nikita Zagainov,Evgeney Bogatyrev,Dmitriy Vatolin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern image super-resolution, visually appealing results, degrade perceived quality, Modern image, methods generate detailed

备注：

点击查看摘要

Abstract:Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact--some are barely noticeable, while others are highly disturbing--yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at this https URL.

58. 【2605.14845】Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

链接：https://arxiv.org/abs/2605.14845

作者：Marta Robledo-Moreno,Ruben Vera-Rodriguez,Ruben Tolosana,Javier Ortega-Garcia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated strong capabilities, Recent advancements, tasks remains unexplored, Signature Verification Challenge, remains unexplored

备注： Accepted at the 14th International Workshop on Biometrics and Forensics

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.

59. 【2605.14843】MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

链接：https://arxiv.org/abs/2605.14843

作者：Rahul Jain,Mayank Patel,Asim Unmesh,Karthik Ramani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved strong visual, strong visual fidelity, temporal coherence, geometric constraints, achieved strong

备注： Under Review

点击查看摘要

Abstract:Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

60. 【2605.14842】Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

链接：https://arxiv.org/abs/2605.14842

作者：Mor Ventura,Roy Hirsch,Yonatan Bitton,Regev Cohen,Roi Reichart

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image editing, abstract image editing, Humans naturally communicate, Abstract, current image editing

备注：

点击查看摘要

Abstract:Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

61. 【2605.14838】Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

链接：https://arxiv.org/abs/2605.14838

作者：Bolin Zhang,Chao Yang,Bin Jiang,Takahiro Komamizu,Ichiro Ide

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：moment semantically similar, Video Moment Retrieval, aiming to identify, video-level correspondences, weakly-supervised Video Moment

备注： 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics

点击查看摘要

Abstract:This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

62. 【2605.14832】Learning Direct Control Policies with Flow Matching for Autonomous Driving

链接：https://arxiv.org/abs/2605.14832

作者：Marcello Ceresini,Federico Pirazzoli,Andrea Bertogalli,Lorenzo Cipelli,Filippo D'Addeo,Anthony Dell'Eva,Alessandro Paolo Capasso,Alberto Broggi

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：directly outputs actionable, Ordinary Differential Equations, outputs actionable control, actionable control trajectories, control trajectories defined

备注： 16 pages, 6 figures, 2 tables. Accepted at IEEE ITSC 2026

点击查看摘要

Abstract:We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at this https URL.

63. 【2605.14821】HDRFace: Rethinking Face Restoration with High-Dimensional Representation

链接：https://arxiv.org/abs/2605.14821

作者：Zirui Wang,Xianhui Lin,Yi Dong,Bo Wei,Gangjian Zhang,Siteng Ma,Zebiao Zheng,Xing Liu,Hong Gu,Minjing Dong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：severe information loss, ill-posed inverse problem, inverse problem due, information loss, remains an ill-posed

备注：

点击查看摘要

Abstract:Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

64. 【2605.14819】he Velocity Deficit: Initial Energy Injection for Flow Matching

链接：https://arxiv.org/abs/2605.14819

作者：Linze Li,Zong-Wei Hong,Shen Zhang,Bo Lin,Jinglun Li,Yao Tang,Jiajun Liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：guarantees constant-velocity trajectories, theoretically guarantees constant-velocity, Matching theoretically guarantees, Flow Matching theoretically, Flow Matching

备注： Accepted by ICML2026

点击查看摘要

Abstract:While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

65. 【2605.14815】Probing into Camera Control of Video Models

链接：https://arxiv.org/abs/2605.14815

作者：Chen Hou,Christian Rupprecht

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：geometrically meaningful content, produce geometrically meaningful, camera control, visual observations, meaningful content

备注：

点击查看摘要

Abstract:Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

66. 【2605.14808】SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track

链接：https://arxiv.org/abs/2605.14808

作者：Lukas Roming,Felix Lehnerer,Jonas V. Funk,Andreas Michel,Georg Maier,Thomas Längle,Jürgen Beyerer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：highly relevant task, modern production environments, Visual anomaly detection, Visual anomaly, highly relevant

备注： Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track

点击查看摘要

Abstract:Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at this https URL.

67. 【2605.14799】Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

链接：https://arxiv.org/abs/2605.14799

作者：Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Xianxun Zhu,Abdenour Hadid

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)

关键词：Convolutional Neural Networks, Generative Adversarial Networks, Neural Networks, Adversarial Networks, Convolutional Neural

备注：

点击查看摘要

Abstract:In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

68. 【2605.14795】COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

链接：https://arxiv.org/abs/2605.14795

作者：Shukun Jia,Shiyu Hu,Yipei Wang,Ximeng Cheng,Yichao Cao,Xiaobo Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Multi-Object Tracking, Referring Multi-Object, Multi-Object Tracking, fundamental structural contradiction, faces a fundamental

备注：

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

69. 【2605.14787】Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

链接：https://arxiv.org/abs/2605.14787

作者：Matteo Attimonelli,Alessandro De Bellis,Aryo Pradipta Gema,Rohit Saxena,Monica Sekoyan,Wai-Chung Kwan,Claudio Pomo,Alessandro Suglia,Dietmar Jannach,Tommaso Di Noia,Pasquale Minervini

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Composed Image Retrieval, Composed Image, reference image, target image satisfying, textual modification

备注：

点击查看摘要

70. 【2605.14785】Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

链接：https://arxiv.org/abs/2605.14785

作者：Alberto Tamajo,Srinandan Dasmahapatra,Rahman Attar

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Neural networks suffer, Neural networks, class-incremental learning, networks suffer, suffer from catastrophic

备注： 37 pages; 24 tables; 7 figures; submitted to a journal

点击查看摘要

Abstract:Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.

71. 【2605.14781】MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

链接：https://arxiv.org/abs/2605.14781

作者：Leon Davies,Qinggang Meng,Mohamad Saada,Baihua Li,Simon Sølvsten

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：projection-induced scale-depth ambiguity, object detection remains, detection remains challenging, object detection, scale-depth ambiguity

备注： 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at [this https URL](https://github.com/bigggs/MonoPRIO)

点击查看摘要

Abstract:Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at this https URL.

72. 【2605.14772】BioHuman: Learning Biomechanical Human Representations from Video

链接：https://arxiv.org/abs/2605.14772

作者：Yujun Huo,He Zhang,Chentao Song,Honglin Song,Zongyu Zuo,Tao Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：injury risk assessment, risk assessment, injury risk, internal biomechanical states, motion

备注：

点击查看摘要

Abstract:Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

73. 【2605.14747】Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

链接：https://arxiv.org/abs/2605.14747

作者：Weimin Xiong,Shuhao Gu,Bowen Ye,Zihao Yue,Lei Li,Feifan Song,Sujian Li,Hao Tian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：graphical user interface, multimodal large language, large language models, driven growing interest, generalization remains constrained

备注： Accepted at ICML 2026

点击查看摘要

74. 【2605.14742】EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

链接：https://arxiv.org/abs/2605.14742

作者：Yuejiao Su,Xinshen Zhang,Zhen Ye,Lei Yao,Lap-Pui Chau,Yi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：embodied intelligent agents, large language models, existing multimodal large, multimodal large language, Understanding human

备注： Accepted at ICML 2026. Project page: [this https URL](https://github.com/yuggiehk/EARL)

点击查看摘要

Abstract:Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

75. 【2605.14733】Video-Zero: Self-Evolution Video Understanding

链接：https://arxiv.org/abs/2605.14733

作者：Ruixu Zhang,Deyi Ji,Lanyun Zhu,Xuanyi Liu,Yuxin Meng,Ruihang Chu,Yujiu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intensive human annotation, improving reasoning models, human annotation, offers a promising, promising path

备注：

点击查看摘要

Abstract:Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

76. 【2605.14731】UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

链接：https://arxiv.org/abs/2605.14731

作者：Xiaoyu Zhan,Xinyu Fu,Chenghao Yang,Xiaohong Zhang,Dongjie Fu,Pengcheng Fang,Tengjiao Sun,Xiaohao Cai,Hansung Kim,Yuanqi Li,Jie Guo,Yanwen Guo

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词：expressive digital avatars, virtual production, Speech-driven gestures, interactive media, fundamental to expressive

备注：

点击查看摘要

Abstract:Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

77. 【2605.14727】CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

链接：https://arxiv.org/abs/2605.14727

作者：Pengcheng Fang,Hongli Chen,Yuxia Chen,Tengjiao Sun,Jiaxin Liu,Xiaohao Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual feature maps, model global interactions, Fourier transforms provide, based on Fourier, Fourier transforms

备注：

点击查看摘要

Abstract:Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

78. 【2605.14717】owards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

链接：https://arxiv.org/abs/2605.14717

作者：Saqib Nazir,Ardhendu Behera

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：morphology remains challenging, bright-field morphology remains, Differential Phase Contrast, label-free Differential Phase, inferring molecular phenotypes

备注： Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at this https URL.

79. 【2605.14716】AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro

链接：https://arxiv.org/abs/2605.14716

作者：Pengcheng Fang,Tengjiao Sun,Dongjie Fu,Xiaoyu Zhan,Yanwen Guo,Hansung Kim,Xiaohao Cai

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：planar trajectory samples, human motion authoring, Transition Masked Diffusion, root positions, planar trajectory

备注：

点击查看摘要

Abstract:Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.

80. 【2605.14712】IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

链接：https://arxiv.org/abs/2605.14712

作者：Shijie Lian,Bin Yu,Xiaopeng Lin,Zhaolong Shen,Laurence Tianruo Yang,Yurun Jin,Haishan Liu,Changti Wu,Hang Yuan,Cong Huang,Kai Chen

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Robot imitation data, human demonstrators act, Robot imitation, similar visual-language observations, task phases

备注： Code can be found in [this https URL](https://github.com/ZGC-EmbodyAI/IntentVLA)

点击查看摘要

81. 【2605.14710】Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

链接：https://arxiv.org/abs/2605.14710

作者：Liren Chen,Lidong Sun,Mingyan Huang,Junzhe Tang,Yinghui Zhu,Guanjie Wang,Yiqing Xia,Ting Xiao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：demonstrated transformative potential, diverse data sources, integrating diverse data, demonstrated transformative, transformative potential

备注： Corresponding author: Ting Xiao

点击查看摘要

Abstract:Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

82. 【2605.14709】Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

链接：https://arxiv.org/abs/2605.14709

作者：Qingyang Liu,Bingjie Gao,Canmiao Fu,Zhipeng Huang,Chen Li,Feng Wang,Shuochen Chang,Shaobo Wang,Yali Wang,Keming Ye,Jiangtong Li,Li Niu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：integrate multimodal understanding, Recent unified models, models integrate multimodal, Recent unified, unified models integrate

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at this https URL.

83. 【2605.14708】StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

链接：https://arxiv.org/abs/2605.14708

作者：Zeyu Chen,Fangmin Zhao,Yan Shu,Yichao Liu,Liu Yu,Yu Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：faces unique challenges, maintaining fine-grained style, generation faces unique, text, extracting precise text

备注： This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

84. 【2605.14705】owards Continuous Sign Language Conversation from Isolated Signs

链接：https://arxiv.org/abs/2605.14705

作者：Youngmin Kim,Kyobin Choo,Jiwoo Park,Minseo Kim,Chanyoung Kim,Junhyeok Kim,Seong Jae Hwang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：written language, systems still mediate, primary language, spoken or written, DHH

备注：

点击查看摘要

Abstract:Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

85. 【2605.14704】SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

链接：https://arxiv.org/abs/2605.14704

作者：Posheng Chen,Powen Cheng,Gueter Josmy Faure,Hung-Ting Su,Winston H. Hsu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：real-world scenes, reside in regions, target objects, objects, Reasoning

备注：

点击查看摘要

Abstract:In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

86. 【2605.14703】Generating HDR Video from SDR Video

链接：https://arxiv.org/abs/2605.14703

作者：SaiKiran Tedla,Francesco Banterle,Trevor Canham,Karanpreet Raja,David B. Lindell,Kiriakos N. Kutulakos,Jiacheng Li,Feiran Li,Daisuke Iso

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dynamic range, standard dynamic range, legacy standard dynamic, dynamic range, upconverting legacy standard

备注：

点击查看摘要

Abstract:The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: this http URL

87. 【2605.14696】EponaV2: Driving World Model with Comprehensive Future Reasoning

链接：https://arxiv.org/abs/2605.14696

作者：Jiawei Xu,Zhizhou Zhong,Zhijian Shu,Mingkai Jia,Mingxiao Li,Jia-Wang Bian,Qian Zhang,Kaicheng Zhang,Jin Xie,Jian Yang,Wei Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Data scaling plays, Data scaling, general intelligence, scaling plays, plays a pivotal

备注：

点击查看摘要

Abstract:Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

88. 【2605.14689】Are Candidate Models Really Needed for Active Learning?

链接：https://arxiv.org/abs/2605.14689

作者：Harshini Mridula Mohan,Maanya Manjunath,Vipul Arya,S.H. Shabbeer Basha,Nitin Cheekatla

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language processing, uncovering complex patterns, Convolutional Neural Networks, Active learning, Deep learning

备注： Accepted for publication in Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

89. 【2605.14664】MiVE: Multiscale Vision-language features for reference-guided video Editing

链接：https://arxiv.org/abs/2605.14664

作者：Tong Wang,Meng Zou,Chengjing Wu,Xiaochao Qu,Luoqi Liu,Xiaolin Hu,Ting Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserving original motion, Reference-guided video editing, image as inputs, requiring the model, reference image

备注： ICML 2026

点击查看摘要

Abstract:Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

90. 【2605.14654】Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

链接：https://arxiv.org/abs/2605.14654

作者：Tan Pan,Shuhao Mei,Yixuan Sun,Kaiyu Guo,Chen Jiang,Zhaorui Tan,Mengzhu Li,Limei Han,Xiang Zou,Yuan Cheng,Mahsa Baktashmotlagh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Self-supervised pre-training methods, Self-supervised pre-training, imaging typically treat, learning representations, masked reconstruction

备注： ICML2026

点击查看摘要

Abstract:Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

91. 【2605.14651】ERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

链接：https://arxiv.org/abs/2605.14651

作者：Omkar Oak,Rukmini Nazre,Rujuta Budke,Suraj Sawant

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：purpose remain limited, Urban vegetation monitoring, Temporal Remote-sensing Repository, vegetation monitoring plays, remain limited

备注： Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London

点击查看摘要

Abstract:Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at this https URL.

92. 【2605.14645】Vision-Based Water Level and Flow Estimation

链接：https://arxiv.org/abs/2605.14645

作者：ZhiXin Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：reached significant maturity, river surface velocity, surface velocity estimation, vision-based methodologies, significant maturity

备注：

点击查看摘要

Abstract:With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at this https URL

93. 【2605.14641】How to Evaluate and Refine your CAM

链接：https://arxiv.org/abs/2605.14641

作者：Luca Domeniconi,Alessandra Stramiglio,Michele Lombardi,Samuele Salti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Class attribution maps, provide local explanations, convolutional neural networks, Class attribution, provide local

备注： Accepted at ICPR 2026

点击查看摘要

Abstract:Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

Comments:
Accepted at ICPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.14641 [cs.CV]

(or
arXiv:2605.14641v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.14641

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

94. 【2605.14635】MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

链接：https://arxiv.org/abs/2605.14635

作者：Tianwei Chen,Takuya Furusawa,Yuki Hirakawa,Ryotaro Shimizu,Mo Fan,Takashi Wada

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：multimodal large language, large language models, ability of multimodal, multimodal large, large language

备注：

点击查看摘要

Abstract:This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

95. 【2605.14631】Action-Inspired Generative Models

链接：https://arxiv.org/abs/2605.14631

作者：Eshwar R. A.,Debnath Pal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Action-Inspired Generative Models, introduce Action-Inspired Generative, dual-network generative framework, generative framework motivated, methods assign uniform

备注： 11 pages, 5 figures, and 4 tables

点击查看摘要

Abstract:We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_\phi$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_\phi$'s guiding signal. Crucially, $V_\phi$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_\phi$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

96. 【2605.14626】UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

链接：https://arxiv.org/abs/2605.14626

作者：Ping Zhou,Haoyu Wang,Mengmeng Zheng,Lei Zhang,Wei Wei,Chen Ding,Fei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires strictly aligned, segmentation requires strictly, real-world scenarios, RGB-T semantic segmentation, requires strictly

备注：

点击查看摘要

Abstract:RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

97. 【2605.14621】Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

链接：https://arxiv.org/abs/2605.14621

作者：Tian Qin,Junzhe Chen,Yuqing Shi,Tianshu Zhang,Qiang Ju,Lijie Wen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large vision-language models, language priors dominate, priors dominate weak, Large vision-language, vision-language models

备注：

点击查看摘要

98. 【2605.14615】CalibAnyView: Beyond Single-View Camera Calibration in the Wild

链接：https://arxiv.org/abs/2605.14615

作者：Boying Li,Cheng Zhang,Weirong Chen,Daniel Cremers,Ian Reid,Hamid Rezatofighi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：classical approaches rely, controlled acquisition setups, fundamental prerequisite, classical approaches, approaches rely

备注： 44 pages, 25 figures

点击查看摘要

Abstract:Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

99. 【2605.14609】Deep Image Segmentation via Discriminant Feature Learning

链接：https://arxiv.org/abs/2605.14609

作者：Adam Dawid Sztamborski,Raül Pérez-Gonzalo,Antonio Agudo

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Accurate image segmentation, Accurate image, segmentation remains challenging, remains challenging, generating sharp

备注： Accepted to ICIP 2026

点击查看摘要

Abstract:Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

100. 【2605.14607】ViMU: Benchmarking Video Metaphorical Understanding

链接：https://arxiv.org/abs/2605.14607

作者：Qi Li,Xinchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：transmission of overt, overt content, content directly presented, video, social

备注：

点击查看摘要

Abstract:Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

101. 【2605.14606】MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

链接：https://arxiv.org/abs/2605.14606

作者：Chunlei Shi,Cui Wu,Xiang Xu,Hao Li,Ni Fan,Xue Han,Yongchao Feng,Yufeng Zhu,Boyu Liu,Zengliang Zang,Hongbin Wang,Yanlan Yang,Dan Niu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate precipitation nowcasting, operational decision-making, Accurate precipitation, essential for disaster, disaster mitigation

备注： 9 pages,7 figures

点击查看摘要

Abstract:Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

102. 【2605.14601】owards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

链接：https://arxiv.org/abs/2605.14601

作者：Kanglin Ning,Yiran Zhao,Wenrui Li,Shaoru Sun,Xingtao Wang,Xiaopeng Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Three-dimensional object detection, comprehensive scene understanding, Three-dimensional object, semantic Gaussian, Gaussian

备注： Current has been accepted by ICME 2026

点击查看摘要

Abstract:Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

103. 【2605.14597】VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

链接：https://arxiv.org/abs/2605.14597

作者：Chunlei Shi,Hao Li,Yufeng Zhu,Boyu Liu,Yongchao Feng,Zengliang Zang,Hongbin Wang,Yanlan Yang,Dan Niu

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Multimedia (cs.MM)

关键词：task for meteorological, meteorological applications, applications but faces, chaotic property, faces challenges due

备注： 5 pages, 2 figures

点击查看摘要

Abstract:Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

104. 【2605.14594】OPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

链接：https://arxiv.org/abs/2605.14594

作者：Bojun Xiong,Zoubin Bi,Xinghui Peng,Yunmu Wang,Junchen Deng,Jun Liang,Jing Li,Bowen Cai,Huan Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：video game industries, game industries, plays a crucial, crucial role, video game

备注： Technical Report

点击查看摘要

Abstract:High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

105. 【2605.14590】FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

链接：https://arxiv.org/abs/2605.14590

作者：Fengyi Zhang,Junya Zhang,Wenzhuo Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：data-governance remains challenging, Robust whole-slide image, remains challenging due, strict data-governance remains, whole-slide image

备注：

点击查看摘要

Abstract:Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

106. 【2605.14581】A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

链接：https://arxiv.org/abs/2605.14581

作者：Ho Hung Lim,Yi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：traditional RAG, Visual RAG, RAG, RAG has offered, offered an alternative

备注： Accepted to Findings of ACL 2026

点击查看摘要

107. 【2605.14579】Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

链接：https://arxiv.org/abs/2605.14579

作者：Zhiquan Chen,Haitao Wang,Guowei Zou,Hejun Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate medical image, large anatomical variability, robust delineation remains, delineation remains challenging, Accurate medical

备注：

点击查看摘要

Abstract:Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

108. 【2605.14569】Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

链接：https://arxiv.org/abs/2605.14569

作者：Yujie Wei,Chenglong Ma,Jianxiong Gao,Chenhui Wang,Shiwei Zhang,Biao Gong,Shuai Tan,Hangjie Yuan,Hongming Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing dynamic visual, magnetic resonance imaging, dynamic visual experiences, functional magnetic resonance, Reconstructing dynamic

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

109. 【2605.14566】SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

链接：https://arxiv.org/abs/2605.14566

作者：Zhiquan Chen,Haitao Wang,Guowei Zou,Hejun Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：missing fine structures, yield poor generalization, segmentation remains challenging, Medical image segmentation, fine structures

备注：

点击查看摘要

Abstract:Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

110. 【2605.14552】LiWi: Layering in the Wild

链接：https://arxiv.org/abs/2605.14552

作者：Yu He,Fang Li,Haoyang Tong,Lichen Ma,Xinyuan Shan,Jingling Fu,Dong Chen,Luohang Liu,Junshi Huang,Yan Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：graphic design domains, Recent advances, empowered impressive layered, design domains, natural image decomposition

备注：

点击查看摘要

Abstract:Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

111. 【2605.14548】Local Spatiotemporal Convolutional Network for Robust Gait Recognition

链接：https://arxiv.org/abs/2605.14548

作者：Xiaoyun Wang,Cunrong Li,Wu Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：promising biometric technology, advantages including non-invasiveness, offers distinctive advantages, distinctive advantages including, unique walking patterns

备注：

点击查看摘要

Abstract:Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

112. 【2605.14534】PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

链接：https://arxiv.org/abs/2605.14534

作者：Fuhao Li,Shaofeng You,Jiagao Hu,Yu Liu,Yuxuan Chen,Zepeng Wang,Fei Wang,Daiguo Zhou,Jian Luan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：Evaluating object removal, Evaluating object, metrics frequently disagree, task is inherently, frequently disagree

备注： Project Page: [this https URL](https://xiaomi-research.github.io/prove/)

点击查看摘要

Abstract:Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: this https URL.

113. 【2605.14530】Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

链接：https://arxiv.org/abs/2605.14530

作者：Sujung Hong,Chanyong Yoon,Seongjae Hwang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large diffusion vision-language, diffusion vision-language models, Large diffusion, enabling parallel decoding, vision-language models

备注：

点击查看摘要

Abstract:Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

114. 【2605.14525】From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

链接：https://arxiv.org/abs/2605.14525

作者：Ling Li,Changjie Chen,Yuyan Wang,Jiaqing Lyu,Kenglun Chang,Yiyun Chen,Zhidong Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human pose estimation, models typically rely, specific moment, typically rely, human pose

备注：

点击查看摘要

Abstract:In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+\delta$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: this https URL

115. 【2605.14518】ArcGate: Adaptive Arctangent Gated Activation

链接：https://arxiv.org/abs/2605.14518

作者：Avik Bhattacharya,Siddhant Dnyanesh Gole,Subhasis Chaudhuri,Alejandro C. Frery,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Arctangent Gated Activation, Adaptive Arctangent Gated, central to deep, Arctangent Gated, influencing non-linearity

备注：

点击查看摘要

Abstract:Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.

116. 【2605.14513】HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

链接：https://arxiv.org/abs/2605.14513

作者：Xuzhe Zheng,Yuexiao Ma,Jing Xu,Xiawu Zheng,Rongrong Ji,Fei Chao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Diffusion-based video generation, deployment remains limited, practical deployment remains, Training-free sparse attention, Diffusion-based video

备注：

点击查看摘要

Abstract:Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

117. 【2605.14487】Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

链接：https://arxiv.org/abs/2605.14487

作者：Jiahao Tian,Yiwei Wang,Gang Yu,Chi Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Autoregressive video diffusion, video diffusion models, Autoregressive video, models support real-time, diffusion models support

备注：

点击查看摘要

Abstract:Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: this https URL.

118. 【2605.14486】Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

链接：https://arxiv.org/abs/2605.14486

作者：Yiheng Li,Yang Yang,Zichang Tan,Gao Li,Zhen Lei,Wenhao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：AI-generated images grows, urgently needed, misuse of AI-generated, VAE and DDIM, generalizable image detection

备注： preprint

点击查看摘要

Abstract:As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: this https URL.

119. 【2605.14475】GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

链接：https://arxiv.org/abs/2605.14475

作者：Jiashun Zhu,Ronghao Fu,Jiasen Hu,Nachuan Xing,Xu Na,Xiao Yang,Zhiwen Lin,Weipeng Zhang,Lang Sun,Zhiheng Xue,Haoran Liu,Weijie Zhang,Bo Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sensing images requires, images requires models, remote sensing images, tiny visual evidence, UHR remote sensing

备注：

点击查看摘要

Abstract:Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at this https URL

120. 【2605.14462】Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

链接：https://arxiv.org/abs/2605.14462

作者：Yubo Zhao,Yujin Chai,Yunao Dong,Chengfeng Zhao,Zijiao Zeng,Yuan Liu,Chi-Keung Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：content creation, simulation-based learning, HOI, monocular HOI, Recovering

备注：

点击查看摘要

Abstract:Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: this https URL

121. 【2605.14461】ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

链接：https://arxiv.org/abs/2605.14461

作者：Ledun Zhang,Yatu Ji,Xufei Zhuang,Xinying Yao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing object removal, making precise removal, precise removal difficult, unnatural background completion, Existing object

备注： 5 pages, 4 figures. Open-source software paper

点击查看摘要

Abstract:Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at this https URL under the Apache-2.0 license.

122. 【2605.14448】hink When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

链接：https://arxiv.org/abs/2605.14448

作者：Longxiang Zhang,Weilong Dai,Guanghao Zhang,Hao Jiang,Pipei Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Multimodal large language, large language models, large language, Multimodal large, reasoning

备注： 30 pages, preprint

点击查看摘要

123. 【2605.14417】Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

链接：https://arxiv.org/abs/2605.14417

作者：Haozhe Jia,Honglei Jin,Yuan Zhang,Youcheng Fan,Shaofeng Liang,Lei Wang,Shuxu Jin,Kuimou Yu,Zinuo Zhang,Jianfei Song,Wenshuo Chen,Yutao Yue

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：future physical transitions, whole-body control requires, requires control representations, physical transitions, Natural language

备注：

点击查看摘要

Abstract:Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

124. 【2605.14406】GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

链接：https://arxiv.org/abs/2605.14406

作者：Yuhao Liu,Sadeer Al-Kindi,Ashok Veeraraghavan,Guha Balakrishnan

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Earth observation imagery, Large-scale pretraining, pretraining on Earth, Earth observation, yielded powerful representations

备注：

点击查看摘要

Abstract:Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

125. 【2605.14403】DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

链接：https://arxiv.org/abs/2605.14403

作者：Yize Liu,Siyuan Yan,Ming Hu,Lie Ju,Xieji Li,Feilong Tang,Wei Feng,Zongyuan Ge

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Dermatological diagnosis requires, Large Language Models, Multimodal Large Language, expert clinical knowledge, diagnosis requires integrating

备注： MICCAI2026 early acceptance

点击查看摘要

Abstract:Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at this https URL.

126. 【2605.14399】SceneForge: Structured World Supervision from 3D Interventions

链接：https://arxiv.org/abs/2605.14399

作者：Jizhizi Li,Jiayang Ao,Danny Wicks,Petru-Daniel Tudosiu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：tasks require supervision, learning tasks require, tasks require, supervision, multimodal learning tasks

备注：

点击查看摘要

Abstract:Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

127. 【2605.14396】Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

链接：https://arxiv.org/abs/2605.14396

作者：Chenyi Wang,Ruoyu Song,Raymond Muller,Jean-Philippe Monteuuis,Jonathan Petit,Z. Berkay Celik,Ryan Gerdes,Ming F. Li

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Autonomous vehicles depend, govern motion planning, directly govern motion, Autonomous vehicles, perceive lane boundaries

备注：

点击查看摘要

Abstract:Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

128. 【2605.14393】Analogical Trajectory Transfer

链接：https://arxiv.org/abs/2605.14393

作者：Junho Kim,Eun Sun Lee,Gwangtak Bae,Seunggu Kang,Young Min Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantically analogous location, translate motion trajectories, study analogical trajectory, analogous location, study analogical

备注：

点击查看摘要

Abstract:We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

129. 【2605.14391】Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

链接：https://arxiv.org/abs/2605.14391

作者：Qi Mao,Zijian Wang,Zhengxue Cheng,Lingyu Zhu,Siwei Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Learned image compression, Learned image, increasingly requires reconstructions, image compression, increasingly requires

备注：

点击查看摘要

Abstract:Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

130. 【2605.14382】Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

链接：https://arxiv.org/abs/2605.14382

作者：Yuheng Wu,Xiangbo Gao,Tianhao Chen,Xinghao Chen,Qing Yin,Zhengzhong Tu,Dongman Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)

关键词：Interactive real-time autoregressive, Interactive real-time, real-time autoregressive video, dynamically evolving event, world modeling

备注：

点击查看摘要

Abstract:Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

131. 【2605.14346】Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

链接：https://arxiv.org/abs/2605.14346

作者：Yuanhang Yao,Ping Qian,Zhu Liu,Long Ma,Weimin Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Single-frame Infrared Small, Infrared Small Target, Small Target Detection, localize weak targets, Single-frame Infrared

备注：

点击查看摘要

Abstract:Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at this https URL.

132. 【2605.14341】AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

链接：https://arxiv.org/abs/2605.14341

作者：Zuopeng Zhao,Ying Liu,Xiaoyu Li,Su Luo,Lu Li,Wenwen Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant progress, generating realistic images, Existing diffusion models, Existing diffusion, realistic images

备注：

点击查看摘要

Abstract:Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

133. 【2605.14337】IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

链接：https://arxiv.org/abs/2605.14337

作者：Yifan Chen,Fei Yin,Chunle Guo,Chongyi Li,Yujiu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：perceive their surroundings, challenging for individuals, individuals and machines, machines to perceive, nighttime circumstances

备注： Accepted by CGI-2025

点击查看摘要

Abstract:In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

134. 【2605.14333】InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

链接：https://arxiv.org/abs/2605.14333

作者：Yang Yue,Fangyun Wei,Tianyu He,Jinjing Zhao,Zanlin Ni,Zeyu Liu,Jiayi Guo,Lei Shi,Yue Dong,Li Chen,Ji Li,Gao Huang,Dong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：practically important patterns, autoregressive generators built, perceptually salient, salient and practically, practically important

备注： Code and checkpoints are available at [this https URL](https://github.com/LeapLabTHU/InsightTok)

点击查看摘要

Abstract:Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

135. 【2605.14326】D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

链接：https://arxiv.org/abs/2605.14326

作者：Zuopeng Zhao,Ying Liu,Kanyaphakphachsorn Pharksuwan,Su Luo,Xiaoyu Li,Maocai Ning

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing image, sensing image generation, Remote sensing, sensing image, remote sensing models

备注：

点击查看摘要

Abstract:Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

136. 【2605.14315】urboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

链接：https://arxiv.org/abs/2605.14315

作者：David Huang,Guile Wu,Chengjie Huang,Bingbing Liu,Dongfeng Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：single forward pass, Recent feed-forward, traditional per-scene optimization, per-scene optimization paradigm, enabling effective multi-view

备注： Technical Report

点击查看摘要

Abstract:Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: this https URL.

137. 【2605.14310】CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

链接：https://arxiv.org/abs/2605.14310

作者：Ailar Mahdizadeh,Puria Azadi,Muchen Li,Xiangteng He,Leonid Sigal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large vision-language models, support future reasoning, ever-growing visual history, vision-language models, requires a compact

备注：

点击查看摘要

Abstract:Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

138. 【2605.14309】ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

链接：https://arxiv.org/abs/2605.14309

作者：Shen Lin,Jing Lin,Junhao Dong,Piotr Koniusz,Li Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：precisely remove target, instance level, making it difficult, affecting unrelated semantics, Machine unlearning

备注：

点击查看摘要

Abstract:Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

139. 【2605.14291】o See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

链接：https://arxiv.org/abs/2605.14291

作者：Chengshuai Zhao,Zhen Tan,Dawei Li,Zhiyuan Yu,Huan Liu

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：posing severe copyright, Large Vision-Language Models, advancement of Large, Large Vision-Language, multimodal web data

备注：

点击查看摘要

140. 【2605.14278】KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

链接：https://arxiv.org/abs/2605.14278

作者：Ruicheng Zhang,Kaixi Cong,Jun Zhou,Zhizhou Zhong,Zunnan Xu,Shuiyang Mao,Wei Liu,Xiu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Aligning streaming autoregressive, preferences is challenging, human preferences, aligning streaming video, Aligning streaming

备注：

点击查看摘要

Abstract:Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

141. 【2605.14274】CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

链接：https://arxiv.org/abs/2605.14274

作者：Zhenyang Ni,Yijiang Li,Ruochen Jiao,Simon Sinong Zhan,Sipeng Chen,Zhenfei Yin,Minshuo Chen,Philip Torr,Zhaoran Wang,Qi Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce visually plausible, visually plausible rollouts, violate physical constraints, Video generation models, generation models trained

备注：

点击查看摘要

Abstract:Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

142. 【2605.14270】Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

链接：https://arxiv.org/abs/2605.14270

作者：Kanghyun Baek,Jaihyun Lew,Chaehun Shin,Jungbeom Lee,Sungroh Yoon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Diffusion Transformers, Multimodal Diffusion, Diffusion Transformers, achieved remarkable progress, generated image

备注： Accepted to ICML 2026

点击查看摘要

Abstract:Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

143. 【2605.14269】PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

链接：https://arxiv.org/abs/2605.14269

作者：Yidong Huang,Zun Wang,Han Lin,Dong-Ki Kim,Shayegan Omidshafiei,Jaehong Yoon,Jaemin Cho,Yue Zhang,Mohit Bansal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Generating realistic human, Generating realistic, motion, central yet unsolved, unsolved challenge

备注： First two authors contributed equally, website: [this https URL](https://phy-motion.github.io/)

点击查看摘要

Abstract:Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

144. 【2605.14267】Image Restoration via Diffusion Models with Dynamic Resolution

链接：https://arxiv.org/abs/2605.14267

作者：Yang Zheng,Wen Li,Zhaoqiang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：exhibited remarkable efficacy, Diffusion models, exhibited remarkable, remarkable efficacy, Diffusion

备注： Accepted by ICML 2026

点击查看摘要

Abstract:Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at this https URL.

145. 【2605.14255】Architecture-Aware Explanation Auditing for Industrial Visual Inspection

链接：https://arxiv.org/abs/2605.14255

作者：Sibo Jia,Zihang Zhao,Kunrong Li

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Industrial visual inspection, visual inspection systems, inspection systems increasingly, systems increasingly rely, Industrial visual

备注：

点击查看摘要

Abstract:Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model's native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen's d) 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.

146. 【2605.14253】owards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

链接：https://arxiv.org/abs/2605.14253

作者：Harry Robertshaw,Yanghe Hao,Weiyuan Deng,Benjamin Jackson,S.M.Hadi Sadati,Nikola Fischer,Tom Vercauteren,Alejandro Granados,Thomas C. Booth

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：improves stroke outcomes, local treatment access, Mechanical thrombectomy, improves stroke, stroke outcomes

备注： Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

147. 【2605.14251】Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

链接：https://arxiv.org/abs/2605.14251

作者：Aarushi Kulkarni,Alarice Lowe,Pratik Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital pathology whole-slide, Conditional generative adversarial, generative adversarial networks, pathology whole-slide images, enabled high-fidelity computational

备注：

点击查看摘要

Abstract:Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (HE) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for HE-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. HE restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational HE staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

148. 【2605.14239】Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

链接：https://arxiv.org/abs/2605.14239

作者：Zekun Long,Judy X. Yang,Jing Wang,Ali Zia,Guanyiman Fu,Jun Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex scenes due, classification is challenging, geometric structures, challenging in complex, complex scenes

备注： 6 pages, 1 figure, conference

点击查看摘要

Abstract:Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

Comments:
6 pages, 1 figure, conference

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.14239 [cs.CV]

(or
arXiv:2605.14239v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.14239

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

149. 【2605.14221】Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

链接：https://arxiv.org/abs/2605.14221

作者：Ahmed Rekik,R. Jarrett Rushmore,Sylvain Bouix,Linda Marrakchi-Kacem

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：magnetic resonance imaging, reliable neuroimaging analysis, yield anatomically inconsistent, voxel-wise deep models, anatomically inconsistent results

备注： 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

点击查看摘要

Abstract:Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

150. 【2605.14201】MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

链接：https://arxiv.org/abs/2605.14201

作者：Rajeev Yasarla,Deepti Hegde,Hsin-Pai Cheng,Shizhong Han,Yunxiao Shi,Meysam Sadeghigooghari,Hanno Ackermann,Litian Liu,Pranav Desai,Fatih Porikli,Mohammad Ghavamzadeh,Hong Cai

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：traditional imitation learning, closed-loop settings due, motion planners, brittle when evaluated, settings due

备注： 19 pages, 9 figures, NeurIPS 2026 submission

点击查看摘要

Abstract:Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

151. 【2605.14191】CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

链接：https://arxiv.org/abs/2605.14191

作者：Zhuojin Li,Hsin-Pai Cheng,Hong Cai,Shizhong Han,Fatih Porikli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deliver remarkable image, high computational cost, incur high computational, Diffusion Transformers, deliver remarkable

备注： 8 pages, 8 figures, CVPR workshop

点击查看摘要

Abstract:Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-{\alpha} and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

152. 【2605.14166】You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

链接：https://arxiv.org/abs/2605.14166

作者：Riccardo Carraro,Anna Briotto,Endi Hysa,Marco Fiorucci,Lamberto Ballan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Face image super-resolution, image super-resolution aims, Face image, recover high-resolution facial, super-resolution aims

备注：

点击查看摘要

Abstract:Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

153. 【2605.14145】Rethinking the Good Enough Embedding for Easy Few-Shot Learning

链接：https://arxiv.org/abs/2605.14145

作者：Michael Karnes,Alper Yilmaz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep visual recognition, Platonic Representation Hypothesis, field of deep, deep visual, visual recognition

备注：

点击查看摘要

Abstract:The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

154. 【2605.14136】DiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

链接：https://arxiv.org/abs/2605.14136

作者：Nurislam Tursynbek,Zhiqiang Lao,Heather Yu,Gedas Bertasius,Marc Niethammer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visually compelling frames, transformers generate visually, generate visually compelling, diffusion transformers generate, compelling frames

备注： CVPR'26 Workshop on Agentic AI for Visual Media

点击查看摘要

Abstract:Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

155. 【2605.14135】PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

链接：https://arxiv.org/abs/2605.14135

作者：Adil Qureshi,Dongki Jung,Jaehoon Choi,Dinesh Manocha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high-fidelity sparse-view indoor, reconstructs closed room, closed room geometry, approach for high-fidelity, high-fidelity sparse-view

备注：

点击查看摘要

Abstract:We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

156. 【2605.14113】ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

链接：https://arxiv.org/abs/2605.14113

作者：Alvaro Lopez Pellicer,Plamen Angelov,Marwan Bukhari,Yi Li,Eduardo Soares,Jemma Kerns

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：networks offer compelling, offer compelling case-based, compelling case-based reasoning, raw continuous outputs, continuous outputs lack

备注： CVR 2026

点击查看摘要

Abstract:While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.

157. 【2605.14110】SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

链接：https://arxiv.org/abs/2605.14110

作者：Sandro Papais,Lezhou Feng,Charles Cossette,Lingting Ge

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：enable strong multi-view, Vision Transformers, high inference latency, enable strong, strong multi-view

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

158. 【2605.14108】Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

链接：https://arxiv.org/abs/2605.14108

作者：Nishi Doshi,Shrey Shah

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Diabetic Retinopathy, rural regions, regions often lack, lack the specialists, specialists and infrastructure

备注：

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

159. 【2605.14104】DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

链接：https://arxiv.org/abs/2605.14104

作者：Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Chongyu Qu,Juming Xiong,Zhengyi Lu,Yanfan Zhu,Marilyn Lionts,Yuechen Yang,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Inferring spatially resolved, histology images offers, Inferring spatially, spatially resolved gene, resolved gene expression

备注：

点击查看摘要

Abstract:Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at this https URL

160. 【2605.14091】Venus-DeFakerOne: Unified Fake Image Detection Localization

链接：https://arxiv.org/abs/2605.14091

作者：GuangJian Team

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：full-image AIGC synthesis, natural image manipulation, AIGC synthesis, full-image AIGC, Fake Image Detection

备注：

点击查看摘要

Abstract:In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.

161. 【2605.14068】CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

链接：https://arxiv.org/abs/2605.14068

作者：Amirreza Mohseni,Mona Mohammadi,Morteza Saghafian,Naser Talebizadeh Saradari

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：hierarchical topological reasoning, textbf, hierarchical topological, introduce CurveBench, pairwise non-intersecting Jordan

备注：

点击查看摘要

Abstract:We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

162. 【2605.14054】Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

链接：https://arxiv.org/abs/2605.14054

作者：Haozhe Wang,Qixin Xu,Changpeng Wang,Taofeng Xue,Chong Peng,Wenhu Chen,Fangzhen Lin

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced Vision-Language Models, Achieving robust perception-reasoning, Achieving robust, Vision-Language Models, advanced Vision-Language

备注： Accepted by ICML 2026 as Spotlight

点击查看摘要

Abstract:Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

163. 【2605.14047】Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

链接：https://arxiv.org/abs/2605.14047

作者：Kieran Carrigg,Sigur de Vries,Amirhossein Sadough,Marcel van Gerven

类目：Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)

关键词：challenging vision tasks, Vision Transformers, vision tasks, challenging vision, devices is severely

备注： 18 pages, 7 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

164. 【2605.14045】PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

链接：https://arxiv.org/abs/2605.14045

作者：Wei Dong,Han Zhou,Terry Ji,Guanhua Zhao,Shahab Asoodeh,Yulun Zhang,Guangtao Zhai,Jun Chen,Xiaohong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Adverse weather removal, overly smooth results, real-world images remains, images remains challenging, remains challenging due

备注： 10 pages, 9 figures, and 4 tables

点击查看摘要

Abstract:Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at this https URL.

165. 【2605.14031】Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

链接：https://arxiv.org/abs/2605.14031

作者：Wuao Liu,Mustafa Chasmai,Subhransu Maji,Grant Van Horn

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：distinguish similar-sounding species, Bioacoustic recognition requires, requires fine-grained acoustic, fine-grained acoustic understanding, recognition requires fine-grained

备注： Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

点击查看摘要

Abstract:Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

166. 【2605.14028】Unified Pix Token And Word Token Generative Language Model

链接：https://arxiv.org/abs/2605.14028

作者：Haun Leung,ZiNan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Transformer, model, generative language model, Transformer, vision encoder backbone

备注： 13 pages, 6 figures

点击查看摘要

Abstract:Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

167. 【2605.13994】CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

链接：https://arxiv.org/abs/2605.13994

作者：Xiaoyue Liu,Xiaohan Yuan,Mark Y Chan,Ching-Hui Sia,Lei Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：technically challenging task, clinically crucial, crucial yet technically, technically challenging, cine MRI

备注：

点击查看摘要

Abstract:Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

168. 【2605.13974】Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

链接：https://arxiv.org/abs/2605.13974

作者：Evelyn Turri,Davide Bucciarelli,Sara Sarto,Lorenzo Baraldi,Marcella Cornia

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：Diffusion Transformers, remain poorly understood, related flow-based architectures, semantics remain poorly, poorly understood

备注： Project page: [this https URL](https://aimagelab.github.io/MAs-DiT/)

点击查看摘要

Abstract:Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

169. 【2605.13923】Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

链接：https://arxiv.org/abs/2605.13923

作者：Bardh Hoxha,Oliver Schön,Hideki Okamoto,Lars Lindemann,Georgios Fainekos

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)

关键词：signal temporal logic, past-time signal temporal, study certified runtime, certified runtime monitoring, partial observability

备注：

点击查看摘要

Abstract:We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emph{reusable}: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emph{semantic basis}, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emph{rolling prediction monitor} that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.

170. 【2605.13869】Elastic Spiking Transformers for Efficient Gesture Understanding

链接：https://arxiv.org/abs/2605.13869

作者：Alberto Ancilotto,Gianluca Amprimo,Stefano Di Carlo,Elisabetta Farella

类目：Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：offer energy-efficient processing, event-based sensor data, Elastic Spiking Transformer, offer energy-efficient, healthcare applications

备注：

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), particularly Spiking Transformers, offer energy-efficient processing of event-based sensor data for healthcare applications. Yet current architectures are rigid: they are trained and deployed as static networks with fixed parameter counts and computational graphs. This limits deployment on neuromorphic hardware such as Loihi and SpiNNaker, where on-chip constraints often require smaller models that trade accuracy for feasibility. We introduce the Elastic Spiking Transformer, a runtime-adaptive architecture that brings elasticity into the spiking paradigm. Inspired by Matryoshka-style representation learning, it embeds nested elasticity in the Feature Extractor, Spiking Self-Attention, and Feed-Forward blocks. Through granularity-aware weight sharing, a single universal model can dynamically slice network width and attention heads at inference time without retraining. This design provides two key advantages for SNNs. First, it allows the model to adjust its parameter footprint to different hardware memory budgets. Second, reducing active neurons also lowers spike firing rates, yielding proportional reductions in synaptic operations, an energy benefit not directly available in standard artificial neural networks. We evaluate the approach on CIFAR10/100, CIFAR10-DVS, and the EHWGesture clinical gesture understanding dataset. Results show that one Elastic Spiking Transformer spans a broad range of complexity-accuracy trade-offs, matching or surpassing independently trained baselines while supporting adaptive, real-time gesture recognition on resource-constrained edge devices.

171. 【2605.13862】Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

链接：https://arxiv.org/abs/2605.13862

作者：Diandian Gu,Jing Lin,Gaohong Liu,Jiahang Liu,Su Ma,Guang Shi,Jun Wang,Qinlong Wang,Qianyi Wu,Zhongcong Xu,Xuanyu Yi,Zihao Yu,Jianfeng Zhang,Zhuolin Zheng,Yifan Zhu,Rui Chen,Hengkai Guo,Xiaoyang Guo,Mingcong Han,Xu Han,Xiu Li,Yixun Liang,Weiqiang Lou,Junzhe Lu,Guan Luo,Minghan Qin,Shuguang Wang,Yuang Wang

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：content generation system, generation system built, application coverage, system built, substantial improvements

备注： Seed3D 2.0 Technical Report; Official Page on [this https URL](https://seed.bytedance.com/seed3d_2_0)

点击查看摘要

Abstract:We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0, with substantial improvements across generation fidelity, simulation-ready capabilities, and application coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression and more efficient decoding. For texture and material generation, we replace the cascaded pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic conditioning for improved material precision and visual fidelity. Beyond single-object generation, Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware decomposition, and training-free articulation generation, enabling coherent scene construction and part-level physical interaction across physics and graphics engines. A large-scale human preference study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates of 69.0% to 89.9% in textured 3D asset generation. Seed3D 2.0 is available on this https URL

172. 【2605.13857】MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

链接：https://arxiv.org/abs/2605.13857

作者：Dongxia Liu,Jie Ma,Xiaochen Yang,Jiancheng Zhang,Bin Xia,Zhehan Kan,Nisha Huang,Jun Liang,Wenming Yang,Jin Li

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：traditional production workflows, cinematic-quality animal effects, animal effects necessitates, creation of cinematic-quality, effects necessitates

备注： Github Page: [this https URL](https://dongxialiu15.github.io/MoZoo/)

点击查看摘要

Abstract:The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

173. 【2605.13855】SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

链接：https://arxiv.org/abs/2605.13855

作者：Wentao Yang,Fanzhen Kong,Zejian Kang,Xiangru Huang

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：photorealistic visual appearance, received tremendous popularity, Gaussian Splatting, visual appearance, received tremendous

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering. Project page:

174. 【2605.13854】Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

链接：https://arxiv.org/abs/2605.13854

作者：Minghao Sun,Chongyang Xu,Yitao Xie,Buzhen Huang,Kun Li

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：real-world interaction analysis, remains challenging due, interaction analysis, pivotal for real-world, real-world interaction

备注： ICME 2026

点击查看摘要

Abstract:Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at this https URL.

175. 【2605.13853】FaceParts: Segmentation and Editing of Gaussian Splatting

链接：https://arxiv.org/abs/2605.13853

作者：Tymoteusz Zapała,Julia Farganus,Dominik Galus,Mikołaj Czachorowski,Piotr Syga,Przemysław Spurek

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：virtual reality, applications in entertainment, Gaussian Splatting avatars, important task, Average Expression Distance

备注：

点击查看摘要

Abstract:Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).

176. 【2605.13852】Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

链接：https://arxiv.org/abs/2605.13852

作者：Ido Sobol,Kihyuk Sohn,Yoav Blum,Egor Zakharov,Max Bluvstein,Andrea Vedaldi,Or Litany

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：adhering to precise, precise geometry, aim to generate, control signals, images

备注： Accepted to CVPR 2026. Project page: [this https URL](https://idosobol.github.io/realiz3d/)

点击查看摘要

Abstract:We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.

177. 【2605.14629】Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors

链接：https://arxiv.org/abs/2605.14629

作者：Julien Zouein,Vibhoothi Vibhoothi,François Pitié,Anil Kokaram

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Neural Radiance Fields, Gaussian Splatting, Radiance Fields, Neural Radiance, offering significant speed-ups

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent framework for real-time, photorealistic scene reconstruction, offering significant speed-ups over Neural Radiance Fields (NeRF). However, the fidelity of 3DGS representations remains heavily dependent on the quality of the initial point cloud. While standard Structure-from-Motion (SfM) pipelines using COLMAP provide adequate initialisation, they often suffer from high computational costs and sparsity in textureless regions, which degrades subsequent reconstruction accuracy and convergence speed. In this work, we introduce an AV1-based feature detection and matching pipeline that significantly reduces SfM processing overhead. By leveraging motion vectors inherent to the AV1 video codec, we bypass computationally expensive exhaustive matching while maintaining geometric robustness. Our pipeline produces substantially denser point clouds, with up to eight times as many points as classical SfM. We demonstrate that this enhanced initialisation directly improves 3DGS performance, yielding an 9-point increase in VMAF and a 63% average reduction in training time required to reach baseline quality. The project page: this https URL

178. 【2605.14123】Keyed Nonlinear Transform: Lightweight Privacy-Enhancing Feature Sharing for Medical Image Analysis

链接：https://arxiv.org/abs/2605.14123

作者：Haebom Lee,Gyeongjung Kim

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：controlled feature sharing, leak patient identity, patient identity information, Feature sharing, lack practical mechanisms

备注：

点击查看摘要

Abstract:Feature sharing via split inference offers a lightweight alternative to federated learning for resource-constrained hospitals, but transmitted features still leak patient identity information and lack practical mechanisms for controlled feature sharing. We propose Keyed Nonlinear Transform (KNT), a drop-in feature transformation that applies key-conditioned obfuscation to intermediate representations. KNT reduces re-identification AUC from 0.635 to 0.586, corresponding to a 36% reduction in above-chance identity signal, while introducing only 0.15 ms CPU overhead, without backbone retraining, and preserving classification performance within 1.0 pp. Our analysis shows that KNT's nonlinear transform prevents closed-form inversion and shifts recovery to iterative gradient-based optimization under full key compromise, substantially increasing inversion difficulty. The same transform generalizes to dense prediction tasks, incurring only a 4.4 pp Dice reduction on skin-lesion segmentation without retraining. These results position KNT as a practical and efficient privacy layer for split inference deployments.

179. 【2605.13910】Covariance-aware sampling for Diffusion Models

链接：https://arxiv.org/abs/2605.13910

作者：Andrea Schioppa,Tim Salimans

类目：Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：pixel-space Diffusion Model, pixel-space Diffusion, Diffusion Model, few-step regime samplers, few-step regime

备注：

点击查看摘要

Abstract:We present a covariance-aware sampler that improves the quality of pixel-space Diffusion Model (DM) sampling in the few-step regime. We hypothesize that in the few-step regime samplers fail because they rely solely on the predicted mean of the reverse distribution, while our solution explicitly models the reverse-process covariance. Our method combines Tweedie's formula to estimate the covariance with an efficient, structured Fourier-space decomposition of the covariance matrix. Implemented as an extension of DDIM, our method requires only a minimal overhead: one extra Jacobian-Vector Product (JVP) per step. We demonstrate that for pixel-based DMs, our method consistently produces superior samples compared to state-of-the-art second order samplers (Heun, DPM-Solver++) and the recent aDDIM sampler, at an identical number of function evaluations (NFE).

180. 【2605.13889】Physics-Grounded Adversarial Stain Augmentation with Calibrated Coverage Guarantees

链接：https://arxiv.org/abs/2605.13889

作者：Mingi Hong

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：hospitals degrades histopathology, degrades histopathology models, textbf, models at deployment, variation across hospitals

备注：

点击查看摘要

Abstract:Stain variation across hospitals degrades histopathology models at deployment. Existing augmentation methods perturb color spaces with arbitrary hyperparameters, lacking both a principled budget and coverage guarantees for unseen centers. We propose \textbf{C}alibrated \textbf{A}dversarial \textbf{S}tain \textbf{A}ugmentation (\textbf{CASA}), which performs adversarial augmentation in the Macenko stain parameter space with a budget calibrated from multi-center statistics via the DKW inequality. On Camelyon17-WILDS (5 seeds), CASA achieves $93.9\% \pm 1.6\%$ slide-level accuracy -- outperforming HED-strong ($88.4\% \pm 7.3\%$), RandStainNA ($85.2\% \pm 6.7\%$), and ERM ($63.9\% \pm 11.3\%$) -- with the highest worst-group accuracy ($84.9\% \pm 0.9\%$) among all 10 compared methods.